Foundation Models
Foundation models in robotics are large, general-purpose models trained on broad datasets spanning multiple robots, tasks, and modalities. When applied to humanoids, they often operate end-to-end: mapping raw inputs like vision, proprioception, or language directly to control actions, without task-specific modules or manually engineered pipelines.
This architecture enables flexible behavior across diverse environments, with a single model adapting to perception, planning, and actuation. NVIDIA’s GR00T N1, introduced in 2025, and Google DeepMind’s RT‑2, released in 2023, are examples of vision-language-action foundation models trained end-to-end. These systems map sensory inputs to robot actions, enabling generalist performance across tasks and platforms, with GR00T N1 pushing this approach toward humanoid-scale deployment.
In humanoids, foundation models are now a core strategy for scaling capability — compressing what once required dozens of hand-built modules into a single adaptive model that learns from interaction.