D2E Framework: Scaling Desktop AI for Embodied Robotics

D2E Framework: Unlocking Embodied AI with Scalable Desktop Game Pretraining

In a breakthrough for robotics and embodied artificial intelligence, researchers have introduced the D2E (Desktop to Embodied AI) framework, demonstrating that pretraining on large-scale desktop game interactions can dramatically accelerate learning for real-world physical tasks. This novel approach overcomes the fundamental bottleneck in embodied AI—the prohibitive cost and difficulty of collecting physical robot data—by leveraging the rich, structured sensorimotor experiences available in digital environments. The research, detailed in a new paper, shows that models pretrained on this "desktop" data can match or exceed the performance of models up to seven times larger on standard robotic manipulation and navigation benchmarks.

The Scalability Problem in Embodied AI

While large language models thrive on internet-scale text data, embodied AI systems—which require learning through physical interaction—have been severely constrained. Collecting real-world robotic trajectory data is slow, expensive, and lacks diversity. Prior attempts to use virtual environments, like VPT for Minecraft or DeepMind's proprietary SIMA project, have either remained confined to their original game domain or kept their data and methods closed-source. The D2E framework directly addresses this gap by establishing an open, complete pipeline from scalable desktop data collection to verified transfer in embodied domains.

Core Components of the D2E Framework

The framework's success hinges on three interconnected components designed to harness and transfer knowledge from digital to physical spaces.

1. The OWA Toolkit: Unifying and Compressing Desktop Data

The first challenge is standardizing the chaotic input from diverse games and applications. The OWA (Observation, Windows, Action) Toolkit solves this by unifying desktop interactions—including mouse movements, keyboard presses, and screen pixels—into a consistent, structured format. Critically, it achieves a 152x compression ratio on the raw data, making internet-scale collection and processing computationally feasible.

2. Generalist-IDM: Enabling Zero-Shot Game Mastery

At the heart of the learning system is the Generalist Inverse Dynamics Model (Generalist-IDM). This model is trained to predict the action between two sequential screen observations. Its key innovation is timestamp-based event prediction, which allows it to achieve strong zero-shot generalization to completely unseen games. This capability is the engine for scalable, automated pseudo-labeling, where the model can generate its own training data from vast amounts of unlabeled gameplay footage.

3. VAPT: Transferring Skills to the Physical World

The final piece is the Visual-Action Pretraining and Transfer (VAPT) method. This component is responsible for successfully adapting the visual and motor primitives learned from screen pixels and mouse clicks to control physical robot arms (manipulation) and agents in 3D spaces (navigation). It validates the core hypothesis that foundational sensorimotor skills are transferable across domains.

Breakthrough Performance with a Leaner Model

The researchers trained a 1-billion-parameter model using a dataset of over 1,300 hours of desktop interaction. This comprised 259 hours of human demonstrations and over 1,000 hours of gameplay pseudo-labeled by the Generalist-IDM. The results were striking. On the LIBERO benchmark for long-horizon manipulation, the D2E model achieved a 96.6% success rate. On the CANVAS visual navigation benchmark, it reached 83.3% success.

These scores are not just high; they are competitive with models of far greater scale. The D2E model's performance matches or surpasses that of π₀ (3.3B parameters) and OpenVLA (7B parameters), demonstrating a massive leap in parameter efficiency. This proves that the quality and structure of pretraining data can be more impactful than simply scaling model size.

Why This Matters for the Future of AI and Robotics

The implications of the D2E framework extend far beyond an academic benchmark. It establishes a practical, scalable, and open-source pathway for advancing embodied intelligence.

Democratizes Embodied AI Research: By using desktop data instead of costly robots, it lowers the barrier to entry for a much wider range of researchers and institutions.
Unlocks Internet-Scale Training: It opens the door to training on petabytes of existing gameplay and desktop interaction videos, creating a data source as vast as the text web.
Accelerates Robot Learning: This paradigm can drastically reduce the time and cost required to train robots for complex tasks in homes, warehouses, and other real-world settings.
Validates a New AI Paradigm: It provides strong evidence that foundational "embodied" skills can be learned in digital worlds and effectively transferred to physical ones, bridging a major gap in AI development.

The team has made all resources, including code, models, and the OWA Toolkit, publicly available, inviting the broader community to build upon this new paradigm. The D2E framework marks a significant step toward embodied AI systems that can learn as flexibly and scalably as their language model counterparts.

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

D2E Framework: Unlocking Embodied AI with Scalable Desktop Game Pretraining

The Scalability Problem in Embodied AI

Core Components of the D2E Framework

1. The OWA Toolkit: Unifying and Compressing Desktop Data

2. Generalist-IDM: Enabling Zero-Shot Game Mastery

3. VAPT: Transferring Skills to the Physical World

Breakthrough Performance with a Leaner Model

Why This Matters for the Future of AI and Robotics

常见问题

D2E Framework: Unlocking Embodied AI with Scalable Desktop Game Pretraining

The Scalability Problem in Embodied AI

Core Components of the D2E Framework

1. The OWA Toolkit: Unifying and Compressing Desktop Data

2. Generalist-IDM: Enabling Zero-Shot Game Mastery

3. VAPT: Transferring Skills to the Physical World

Breakthrough Performance with a Leaner Model

Why This Matters for the Future of AI and Robotics

常见问题

相关推荐

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians

UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Graph Attention Based Prioritization of Disease Responsible Genes from Multimodal Alzheimer's Network

Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?