NaVILA: Legged Robot Vision-Language-Action Model for Navigation

UC San Diego, USC, NVIDIA
*Equal Contribution, ordered alphabetically
Equal Advising

TL;DR

NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.

🔥 Highlights

  1. Strongly Generalizable VLA Model. We tamed a VLM model into a VLA model and trained it on diverse datasets, including real-world human video data, simulated indoor navigation data, and question-answering tasks. NaVILA outperform all methods that do not rely on simulator pre-trained waypoint predictors, even when those methods leverage additional sensors.
  2. Vision-Based Legged Robot Policy. We introduce the first end-to-end vsion-based locomotion policy trained without teacher-student distillation. By directly interacting with the environment using LiDAR during training, our approach significantly reduces the sim-to-real gap. Our policy ensures safety in challenging environments, such as near transparency surfaces, and excels at traversing rough terrain.
  3. VLN-CE-Isaac Benchmark. We introduce a high-fidelity, physics-realistic benchmark for low-level robotic control. It provides a safe, scalable platform to evaluate navigation across diverse robots and scenarios, reducing real-world testing costs and risks.
  4. Real-World Deployment. NaVILA demonstrates strong performance in challenging real-world environments with quadruped and humanoid robots, showcasing its generalization capabilities and robustness.

Real-world Results: Unitree Go2

Real-world Results: Unitree G1

Learning from YouTube Human Touring Videos


Results: VLN-CE-Isaac NaVILA-Go2-Vision

Results: VLN-CE-Issac NaVILA-H1-Vision

Results: VLN-CE-Issac Vision vs Blind Policy

The vision-based policy performs better in Success Rate than the blind policy because it avoids obstacles more effectively.

Results: R2R-CE (Habitat)

Results: RxR-CE (Habitat)

Citation


@article{cheng2024navila,
    title = {{NaVILA: Legged Robot Vision-Language-Action Model for Navigation}},
    author = {Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Zou, Xueyan and Kautz, Jan and Biyik, Erdem and Yin,
    Hongxu and Liu, Sifei and Wang, Xiaolong},
    journal = {arXiv preprint arXiv:2412.04453},
    year={2024},
}
  

Acknowledgement

We sincerely thank Chengjing Yuan for their support with hardware setup and 3D modeling. We also thank Xuxin Cheng and Jialong Li for their help in setting up the G1 robot, as well as Jiazhao Zhang and Yukang Chen for their valuable discussions.