The race toward general-purpose embodied intelligence is no longer theoretical. As humanoid robots begin stepping out of research labs and into real-world deployment scenarios, the critical enabler has emerged clearly: data—real-world, high-fidelity, multimodal interaction data. And the most urgent question is: who will be the first to collect one million hours of such data?
Two frontrunners are shaping this high-stakes race: Tesla, with its widely publicized humanoid robot Optimus, and Figure AI, a venture-backed robotics company led by Brett Adcock, building the Helix platform. Both aim to build robots that can move, perceive, reason, and act autonomously. But beyond sleek demonstrations and social media snippets, the core challenge is data at scale—and the infrastructure to generate, process, and learn from it.
Tesla Optimus Gen 2
Figure 02
Figure’s approach is built around quality, task-aligned data. In its logistics use cases, Helix has been shown completing real-world tasks such as picking and sorting packages, opening doors, and fetching items. The impressive part: it learned these not from massive simulation runs, but from just 500 hours of teleoperated, annotated demonstrations.
This focused strategy—grounded, multi-modal data with language labels, visual context, and action traces—has enabled Figure’s Helix robot to scale from learning individual tasks to generalizing across object types and goals, a key marker of VLA (Vision-Language-Action) intelligence.
The company now plans to deploy over 100,000 robots across industrial settings in the next four years. With every robot continuously collecting and streaming real-world multimodal data, the path to a million interaction hours becomes not just possible, but inevitable. Furthermore, Figure has recently shown Helix performing long-horizon autonomous tasks, such as handling 60-minute logistics workflows with minimal human intervention, proving that its model is already well into generalization territory.
Tesla’s strategy with Optimus is characteristically ambitious: leveraging its vertically integrated factory ecosystem and motion-capture infrastructure. The company has hired over 50 full-time mocap operators, recording thousands of hours of human motions to train Optimus in locomotion, grasping, and manipulation.
Optimus Gen 2, Tesla’s latest prototype, shows meaningful improvements—11 degrees of freedom in each hand, improved balance, lighter weight, and faster walking speeds. Elon Musk claims Optimus will be deployed in Tesla’s Gigafactories as soon as late 2025.
But progress has been slowed by engineering and production issues. Reports indicate delays due to overheating joints and actuator reliability. Moreover, while the volume of mocap data is high, its semantic richness is low—human motion without context lacks the supervision needed for generalizable models. For instance, a grasping motion lacks meaning unless paired with object labels, goals, and feedback signals.
Tesla may solve this by combining mocap with synthetic data or vision-language alignment, but currently, their approach is still primarily kinematic imitation, not cognitive learning.
Reaching a million hours of useful data requires more than just scale:
Diversity of tasks: Real-world data must span multiple domains—logistics, household chores, assistance, and human collaboration.
Contextual alignment: It must include visual frames, language instructions or queries, action sequences, and outcomes.
Temporal richness: Long-horizon task execution, failure-recovery, decision branching, and memory are all crucial.
Scalable annotation: Manual annotation is not sustainable; models must self-label or utilize weak supervision.
Edge deployment: Robots in the field must be able to learn online, sync experiences, and refine shared models.
Figure’s strategy aligns more closely with this framework, while Tesla’s may eventually catch up if they enhance their supervision layers.
Based on current progress:
Figure AI is on track to achieve the first million hours of structured, grounded, VLA-compatible data within 2–3 years. Its robots are already generalizing, its data loop is tight, and its deployments are beginning to scale.
Tesla has unparalleled scale potential, but its approach is bottlenecked by hardware readiness and the lack of integrated semantic context in the data.
Unless Tesla dramatically shifts toward a vision-language-action training loop—incorporating language supervision, reward functions, and richer real-world deployments—Figure may reach the GPT moment of robotics first.
Much like GPT’s ascent was catalyzed by trillions of internet tokens, the first company to gather and train on millions of real-world interaction hours will have a first-mover advantage in:
Building truly generalist robot assistants
Creating a proprietary data moat
Setting platform standards for robot APIs and tasks
Licensing AI brain models to other hardware OEMs
In other words, data is destiny. And in the new era of embodied AI, whoever owns the most meaningful data, wins.
In the race for embodied intelligence, the winner won't be the one who collects the most footage—but the one who understands the world best through experience, feedback, and adaptation.
2025/06/30