LocoMimic | Vyvaswath Kalva

Introduction to Robot Learning course project | Carnegie Mellon University | Spring 2026
My focus: off-policy RL

On a single environment, SAC matched and beat PPO at equal environment interactions, but scaling to thousands of parallel environments made it collapse mid-training. With a combination of fixes, MeanSAC eliminated the collapses and raised mean return from 13.3 to 30.0.

Teaching a robot to walk is hard. Teaching it to walk like a human is harder. Most locomotion and whole body controller approaches rely on hand-crafted reward functions that tell the robot what good walking looks like, stay upright, move forward, don’t fall. We take a different approach. Instead of defining walking, we show it. The robot learns to imitate the motion directly from a clip.

We train a policy using reinforcement learning on the Unitree G1 humanoid in MuJoCo. The reward function is built on ideas from DeepMimic [1] and BeyondMimic [2]. At each timestep the robot is rewarded for how closely its body positions, orientations, and velocities match a reference walking motion from the LAFAN1 dataset [3].

We compare two algorithms, PPO [4] and SAC [5], but my focus was on getting SAC to work. SAC is an off-policy method that learns from stored past experience rather than discarding it after each update, making it substantially more sample efficient than on-policy methods like PPO. The catch is that this efficiency comes with fragility: the critic can overestimate Q-values in ways that quietly poison training.

I started small, with a single environment in MuJoCo. At that scale it worked, it matched and even beat PPO for the same number of environment interactions. The problem showed up when I scaled it up. Training fast means running thousands of environments in parallel to collect experience, and at that scale vanilla SAC broke.

Every SAC run followed the same pattern: reward peaks, collapses, and never recovers. Diagnosing it took a mix of architectural changes, training procedure adjustments, and a closer look at how the environment interacts with off-policy learning. The result is MeanSAC, which replaces the standard min-Q target with a mean-Q target and adds LayerNorm to the critic, following recent recipes for fast, stable off-policy RL in humanoid control, FastSAC [6] and FastTD3 [7]. Evaluated after 800 million training steps, these changes raised mean return from 13.3 to 30.0 and mean episode length from 237 to 490 steps.

PPO still evaluated higher after the same 800M steps, but the reward was tuned for PPO from the start. At single-environment scale SAC was actually the stronger of the two; the point of MeanSAC was to keep that working once I scaled up.

Vanilla SAC (red) collapses mid-training around 200M steps and never fully recovers. MeanSAC (purple) holds a higher, stable return and episode length across 800M steps.

For training plots and full results, see the project website. The full report is available here.

References

X. B. Peng, P. Abbeel, S. Levine, M. van de Panne. “DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills.” ACM Transactions on Graphics, 2018. doi:10.1145/3197517.3201311
Q. Liao et al. “BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion.” 2025. arXiv:2508.08241
Lvhaidong. “LAFAN1 Retargeting Dataset.” Hugging Face, 2024. link
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov. “Proximal Policy Optimization Algorithms.” 2017. arXiv:1707.06347
T. Haarnoja, A. Zhou, P. Abbeel, S. Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” 2018. arXiv:1801.01290
Y. Seo, C. Sferrazza, J. Chen, G. Shi, R. Duan, P. Abbeel. “Learning Sim-to-Real Humanoid Locomotion in 15 Minutes.” 2025. arXiv:2512.01996
Y. Seo, C. Sferrazza, H. Geng, M. Nauman, Z.-H. Yin, P. Abbeel. “FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control.” 2025. arXiv:2505.22642