Understanding How NVIDIA Cosmos Policy Works | Paper: Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

These are my reading notes on the NVIDIA Cosmos Policy paper, "Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning."

Introduction

Physical AI is really gaining momentum.

Here's the paper I'll be going through:

[2601.16163] Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

The gist: by fine-tuning NVIDIA's video generation/World Model, Cosmos Predict, into a Policy Model, they achieved state-of-the-art performance on robotics task benchmarks.

I'll be summarizing my notes as I read through the paper.

Note: All figures in this article are cited from the paper above.

Note: This article was translated from my original post.

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Background

The core motivation here is to effectively leverage the spatiotemporal understanding of video generation models for robot control.

  • Recent video generation models (Cosmos, Wan2.1, etc.) have learned how the world changes well enough to generate temporally consistent videos that follow physical laws (World Models).
  • Meanwhile, VLA (Vision-Language-Action) models — the current mainstream approach in robot control — are pre-trained on still images and text, giving them only a limited understanding of temporal physical dynamics.
  • Prior work has attempted to apply video models to robotics, but faced several challenges:
    • Multi-stage training pipelines were required, such as video fine-tuning followed by separate action module training.
    • Additional architectural components were needed, like action diffusers or inverse dynamics models.
    • When building unified models from scratch, they couldn't benefit from pre-trained video models.
  • Cosmos Policy addresses all of these by converting a video model into a robot policy through a single fine-tuning step with no architectural changes.

Overview of Cosmos Policy. It takes the current state as input and simultaneously outputs action chunks, future states, and values. There are no architectural changes from the base model (Cosmos Predict).

Method

Latent Frame Injection

Video models originally generate image sequences by denoising them in latent space.

In Cosmos Policy, non-image data (robot state, action chunks, state values) is directly injected into this latent frame sequence as latent frames. Numerical data is normalized and replicated to match the shape of latent frames, then inserted — allowing the model to reuse the video model's diffusion training framework as-is.

How Latent Frame Injection works. After tokenizing the image sequence with a VAE, additional modalities — robot state, action chunks, and state values — are injected as latent frames.

Joint Training of Policy, World Model, and Value Function

A single model learns three functions simultaneously.

50% of each batch is used for Policy learning p(a, s', V(s')|s), 25% for World Model learning p(s', V(s')|s, a), and 25% for Value Function learning p(V(s')|s, a, s').

By switching the conditioning scheme — i.e., which latent frames serve as conditions and which as generation targets — three different learning objectives are achieved within the same architecture.

Batch training in Cosmos Policy

Model-Based Planning

Cosmos Policy can be used both as a direct policy and as a policy with planning.

During planning, Best-of-N sampling is used:

  1. Generate N action candidates from the Policy
  2. Predict the resulting state for each candidate using the World Model
  3. Score the future states with the Value Function
  4. Execute the highest-scoring action

However, since demonstration data only contains successful examples, the model struggles to predict future states accurately. To improve prediction quality, the World Model and Value Function are additionally trained on rollout data (data collected by actually running the policy, including both successes and failures).

The original checkpoint is used as the policy model, while the additionally trained checkpoint serves as the planning model — a dual-model setup.

  • Policy model: Trained extensively on demonstration data, it generates high-quality actions.
  • Planning model: Having seen both successes and failures, it predicts future states and values.

Comparison of World Model predictions. The base Cosmos Policy, trained only on demonstration data, cannot correctly predict failure states (e.g., losing grip on a Ziploc bag) (top row). After fine-tuning on rollout data, it predicts actual future states much more accurately, enabling effective planning (bottom row).

Results

Evaluation was conducted on three benchmarks:

  • LIBERO (simulation): Single-arm robot, four task suites, trained with 50 demonstrations each
  • RoboCasa (simulation): 24 kitchen tasks, trained with 50 demonstrations each
  • ALOHA (real robot): Bimanual robot, four tasks, trained with a total of 185 demonstrations

Let's look at each result in turn.

↑ LIBERO results.

Cosmos Policy achieved a 98.5% average success rate across the four task suites, outperforming all methods — including VLA models such as π0.5 (96.9%), CogVLA (97.4%), and OpenVLA-OFT (97.1%) — to set a new SOTA. On LIBERO-Long in particular, it reached 97.6%, a significant improvement over the previous best of 95.4%.

↑ RoboCasa results.

Cosmos Policy achieved SOTA with an average success rate of 67.1%. Notably, while other top methods (Video Policy, FLARE, GR00T-N1.5, etc.) used 300 demonstrations per task, Cosmos Policy surpassed them with only 50 — demonstrating remarkable data efficiency.

↑ ALOHA results.

Cosmos Policy achieved the highest average score of 93.8 across all methods. It recorded the top score on 3 out of 4 tasks, with particularly strong performance on high-precision, high-diversity tasks like "putting candy into a bowl" and "putting candy into a Ziploc bag."

↑ Ablation results on LIBERO.

Removing the auxiliary losses (joint prediction of future states and values) reduced the success rate by 1.5 points, and using random initialization instead of a pre-trained model reduced it by 3.9 points — confirming the importance of both components.

↑ Measuring the effect of planning on two difficult ALOHA tasks.

Model-Based Planning (V(s')) achieved the best performance with an average score improvement of 12.5 points. Model-Free Planning (Q(s,a): no state prediction) struggled to learn the Q-function with limited rollout data, falling short of the Model-Based approach.

Conclusion

That wraps up my brief summary notes on the Cosmos Policy paper.

The idea of taking the output of a World Model — one that generates video grounded in physical laws — and directly turning it into robot actions really feels like the dawn of an era where AI truly understands the world and acts on it. Exciting stuff.

I'm also curious about how other high-performance models work, like Physical Intelligence's π-series and OpenVLA.

[Related Articles]

en.bioerrorlog.work

en.bioerrorlog.work

References