ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

He, Xialin; Xu, Sirui; Li, Xinyao; Dong, Runpei; Bian, Liuyu; Wang, Yu-Xiong; Gui, Liang-Yan

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Xialin He^*, Sirui Xu^*, Xinyao Li, Runpei Dong,
Liuyu Bian, Yu-Xiong Wang^†, Liang-Yan Gui^†

University of Illinois Urbana-Champaign
^*Equal Contribution ^†Equal Advising

Paper arXiv

ULTRA is an all-in-one controller for humanoid loco-manipulation: track when references exist; act from egocentric perception and sparse intent when they don't.

Abstract

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

General Interaction Tracking × Mocap

Long-Horizon Goal Following × Mocap

Fine-Grained Keyboard Control × Mocap

Long-Horizon Goal Following × Egocentric Depth

Sim2Sim Transfer × Egocentric Depth

Blue points represent the egocentric point cloud, green points denote the noise-perturbed point cloud, and yellow points indicate the object goal positions.

Our Data Engine: Physics-Driven Neural Retargeting

Retargeted Motion Library

Diverse loco-manipulation corpus generated by one retargeting policy that transfers human motion capture to the humanoid embodiment.

Scalable Retargeting

Neural retargeting generalizes to unseen scales for both objects and trajectories. Once the policy is trained, unlimited data can be acquired effortlessly.

BibTeX

@article{he2026ultra,
      title={ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation},
      author={He, Xialin and Xu, Sirui and Li, Xinyao and Dong, Runpei and Bian, Liuyu and Wang, Yu-Xiong and Gui, Liang-Yan},
      journal={arXiv preprint arXiv:2603.03279},
      year={2026}
}