CVPR 2026

Chain of World: World Model Thinking in Latent Motion

CoWVLA teaches a VLA model to reason through a compact chain of latent motion, preserving world-model temporal structure without spending capacity on redundant background reconstruction.

Paper Code View Demos

Fuxiang Yang^1,2, Donglin Di², Lulu Tang^3,6, Xuancheng Zhang², Lei Fan⁴, Hao Li², Wei Chen², Tonghua Su^1,5, Baorui Ma²

¹Harbin Institute of Technology ²Li Auto ³BAAI ⁴UNSW ⁵Chongqing Research Institute of HIT ⁶Peking University

Overview

Abstract

Vision-language-action models benefit from temporal prediction, but explicit future-frame modeling often wastes capacity on static background reconstruction. Pure latent-action approaches are compact, yet they typically model only short transitions and miss richer world dynamics.

CoWVLA bridges these two lines. It first disentangles structure and motion with a pretrained video VAE, then pretrains a VLA decoder to infer a continuous latent motion chain from language plus the first frame, and finally co-fine-tunes latent dynamics with sparse keyframes and FAST action tokens in one autoregressive decoder.

The result is a dynamics-aware VLA that keeps the world-model benefits of temporal reasoning while remaining compact, interpretable, and effective on robotic manipulation.

Key Idea

Reason over latent motion chains instead of reconstructing every future background pixel.

Pretraining Signal

Instruction + first frame supervise both terminal-frame prediction and continuous motion latent prediction.

Action Alignment

Co-fine-tuning ties sparse visual checkpoints and discretized action chunks to one shared dynamics token.

Method

Three-stage training pipeline

CoWVLA combines a latent motion extractor with a unified VLA decoder. The model reasons with a dedicated motion query token and aligns latent dynamics with action generation.

Disentangle structure and motion

A pretrained VidTwin-based video VAE decomposes each clip into structure latent and motion latent, isolating dynamic information from static scene content.

Pretrain to think in latent motion

From an instruction and the initial frame, the decoder predicts the terminal frame while inferring a continuous latent motion chain through one learnable query token.

Co-fine-tune actions with sparse keyframes

Alternating keyframes and FAST action chunks keep long-horizon dynamics explicit, aligning latent motion reasoning with stable action prediction.

Results

State-of-the-art across simulation benchmarks

LIBERO 95.6%

Best average success rate, improving over UniVLA by 0.6 points.

SimplerEnv-WidowX 76.0%

Best average success rate, outperforming FlowVLA by 2.0 points.

Cross-domain Stability Balanced

Strong performance on both world-model-heavy and real-data-transfer settings.

Model	Paradigm	LIBERO Avg.	SimplerEnv Avg.
OpenVLA	Direct action	76.5	1.0
villa-X	Latent action	90.1	62.5
TLA	Latent action	95.2	48.0
UniVLA	World model	95.0	68.7
FlowVLA	World model	88.1	74.0
CoWVLA	World model + latent motion	95.6	76.0

Visualization of disentangled structure and motion latents.

Motion-only and structure-only reconstructions show that CoWVLA isolates dynamic cues while preserving global layout.

Comparison of future-frame prediction strategies.

Compared with world-model or single-goal prediction baselines, CoWVLA produces more instruction-aligned future dynamics.

Cross reconstruction visualization on LIBERO.

Cross reconstruction highlights motion-affected regions, showing clean separation between static structure and robot motion.

Real-world deployment setup used for robot data collection and live VLA evaluation.

Supplementary Demos

Latent motion, real-world data, and deployment videos

1. Latent_Motion_Extractor_cross_recon

This group contains video VAE cross-reconstruction results. Each video shows three horizontally concatenated clips: a structure-latent source on the left, a motion-latent source in the middle, and the reconstructed result on the right.

2. real_robot / collected_data

This group contains two episodes from real-world data collection, recording the robot's observations during actual task execution.

3. real_robot / realtime_deploy

This group contains recordings from the real-time deployment of the VLA model controlling the robot arm.

3.1 cameraview_phone

Side phone recordings of the robot arm.

3.2 cameraview_realscene

Real-scene camera view used as input to the cloud-hosted VLA model. Each frame is sent to the model, which predicts a sequence of 10 actions, executes them, and then requests the next frame, leading to a choppy appearance.

3.3 merged

Side-by-side merged videos of the phone view and the real-scene camera view.

Reference

Citation

@inproceedings{yang2026cowvla,
  title     = {Chain of World: World Model Thinking in Latent Motion},
  author    = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}