Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu1, Yihao Pi1, Yinling Zhang2, Yuheng Li3, Xueyan Zou1
1Tsinghua University - IEI Lab, 2UW-Madison, 3Adobe Research

Abstract

Despite their promise as implicit world models, assessing the 3D physical realism of generative video models remains difficult. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking, lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale--depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes missed by perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and world model.

PDI-Bench Leaderboard

Quantitative comparison of physical consistency on PDI-Bench. Lower PDI is better.

Rank Model Organization PDI Score ↓ Details
1
GT logo Ground Truth (GT)
Real World 0.1206
2
Seedance logo Seedance 2.0
ByteDance 0.2422
3
CogVideoX logo CogVideoX-3
Zhipu AI 0.2480
4
Veo logo Veo 3.1
Google 0.4521
5
Wan logo Wan 2.2
Alibaba 0.5595
6
Sora logo Sora
OpenAI 0.8255
7
Hunyuan logo HunyuanVideo
Tencent Hunyuan 0.8825

Click "View Radar" to open the three-metric radar chart.

Visualization of the PDI-Bench Target-Uplift-Anchor pipeline.
Step 1: Semantic Targeting (SAM 2) isolates the auditing subject to establish precise 2D spatial boundaries. Step 2: 3D Geometric Uplifting (MegaSaM) reconstructs the physical environment and projects pixels into a unified 3D world-space pointmap. Step 3: 3D Structural Anchoring (CoTracker3) tracks dense pixel-space anchors and lifts them into structurally meaningful 3D trajectories for subsequent rigidity auditing.

Examples

Bear

Black Swan

Kite Surf

Stroller

Soccer Ball

Car Shadow

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}