CrashTwin: A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models

Nuo Chen^1*, Lulin Liu^1,2*, Zihao Li¹, Ziyao Zeng³, Zihao Zhu¹, Wenyan Cong⁴, Junyuan Hong⁴, Yunhao Yang⁴, Zhengzhong Tu¹, Yan Wang⁵, Boris Ivanovic⁵, Marco Pavone^5,6, Zhangyang Wang⁴, Yang Zhou¹, Zhiwen Fan¹

¹Texas A&M University ²University of Minnesota ³Yale University ⁴University of Texas at Austin ⁵NVIDIA ⁶Stanford University

^*Equal contribution

Paper Code Data Gallery

Benchmark Evaluation Leaderboard

Lower is better unless marked with an upward arrow.

Models	Spatio-temporal Consistency		Momentum & Energy Conservation			World-dynamics Integrity
Models	E_warp ↓	E_div ↓	J_p ↓	J_H ↓	J_E ↓	S_ID ↑	D_ad ↓
Open-Source Models
SkyReel-1.3B	0.0227	0.6103	0.9566	0.9628	0.9457	0.6660	0.3592
Wan 2.1-14B	0.0179	0.6320	0.8235	0.8494	0.7864	0.6760	0.3117
Wan 2.2-5B	0.0145	0.5959	0.8899	0.8975	0.8649	0.7254	0.3109
Cosmos-Predict2-2B	0.0240	0.6748	0.8890	0.8954	0.8590	0.6129	0.3462
Cosmos-Predict2-14B	0.0117	0.7180	0.6828	0.7629	0.6047	0.6737	0.3327
Proprietary Models
Google Veo 3.1	0.0097	0.6202	0.7743	0.8011	0.7460	0.7232	0.3075
Hailuo 2.3	0.0151	0.6143	0.7664	0.7812	0.7285	0.6203	0.2968
Seedance V1 Pro	0.0166	0.6130	0.7725	0.7837	0.7209	0.6007	0.3002

Abstract

Generative world models hold immense promise as scalable simulators for autonomous systems, particularly for synthesizing rare but safety-critical multi-agent interactions, such as vehicle collisions. However, current evaluation paradigms index heavily on visual fidelity and semantic alignment, leaving a critical blind spot: they cannot reliably quantify whether generated dynamics actually obey the fundamental physical laws required for reliable simulation. Assessing this physical plausibility is inherently difficult due to a lack of grounded metrics and the challenge of extracting metric-scale kinematics from uncalibrated video rollouts. To bridge this gap, we introduce CrashTwin, a physics-grounded evaluation framework designed to stress-test the physical trustworthiness of world models. CrashTwin couples a diverse dataset of 25K controllable synthetic and 12K in-the-wild collision sequences with a novel calibration-free reconstruction pipeline, enabling the recovery of 3D physical attributes directly from monocular generations. We propose a diagnostic suite that systematically evaluates three dimensions: spatio-temporal consistency, momentum and kinetic energy conservation, and world-dynamics integrity. Extensive benchmarking of state-of-the-art models reveals a crucial insight: high perceptual quality frequently masks severe physical violations during complex interactions. By quantitatively exposing these failure modes, CrashTwin provides a vital diagnostic tool for developing physically grounded world models capable of reliable real-world simulation.

Physical Failure Modes

Representative spatio-temporal, momentum and energy, and world-dynamics violations observed across recent world models.

Representative physical violations across world models

CrashTwin Pipeline

CrashTwin connects scenario design, monocular reconstruction, global dynamic recovery, and physics-grounded evaluation.

CrashTwin Dataset

25.6K controllable synthetic crashes

12.6K in-the-wild accidents

3 diagnostic metric families

7 representative collision topologies

Physics-Grounded Metrics

Spatio-Temporal Consistency

E_warp measures flow warping error for temporal coherence.

E_div measures normalized flow divergence for near-rigid motion.

Momentum & Energy Conservation

J_p captures linear momentum residual.

J_H captures angular momentum residual.

J_E penalizes kinetic energy gain.

World-Dynamics Integrity

S_ID measures instance identity stability.

D_ad measures appearance-drift distance.

Global Dynamic Reconstruction

A calibration-free pipeline converts fragmented monocular tracks into globally consistent, metric-scale collision dynamics.

Track refinement with relinking, depth correction, ego-motion compensation, and Kalman filtering

Metric-Human Alignment & Model Comparison

Physical metrics align with human judgments and expose substantial physical inconsistency across current world models.

Metric-human alignment and physics-based model comparison

Effect of Post-Training

Physics-oriented post-training improves all diagnostic metric families on Cosmos-Predict2-2B.

Model Variant	E_warp ↓	E_div ↓	J_p ↓	J_H ↓	J_E ↓	S_ID ↑	D_ad ↓
Base (w/o post-training)	0.0240	0.6748	0.8890	0.8954	0.8590	0.6129	0.3462
Base (w/ post-training)	0.0085	0.6279	0.6479	0.7296	0.5534	0.7746	0.2953
Ground Truth	0.0075	0.6549	0.3089	0.4620	0.2502	0.8010	0.2819