PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu¹, Yihao Pi¹, Yinling Zhang², Yuheng Li³, Xueyan Zou¹

¹Tsinghua University - IEI Lab, ²UW-Madison, ³Adobe Research

Abstract

Despite their promise as implicit world models, assessing the 3D physical realism of generative video models remains difficult. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking, lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale--depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes missed by perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and world model.

Overview of PDI-Bench evaluation across GT and video generation models

Overview of the PDI-Bench Evaluation.
(Top) Qualitative samples from our dataset, featuring Ground Truth (GT) videos and generated sequences from state-of-the-art models.
(Bottom) The corresponding PDI-Scores for GT and each model. Lower scores indicate better adherence to 3D physical laws (scale alignment, motion consistency, and structural rigidity).

PDI-Bench Leaderboard

Quantitative comparison of physical consistency on PDI-Bench. Lower PDI is better.

Rank	Model	Organization	PDI Score ↓
1	Ground Truth (GT)	Real World	0.1206
2	Seedance 2.0	ByteDance	0.2422
3	CogVideoX-3	Zhipu AI	0.2480
4	Veo 3.1	Google	0.4521
5	Wan 2.2	Alibaba	0.5595
6	Sora	OpenAI	0.8255
7	HunyuanVideo	Tencent Hunyuan	0.8825

Click "View Radar" to open the three-metric radar chart.

Visualization of the PDI-Bench Target-Uplift-Anchor pipeline.
Step 1: Semantic Targeting (SAM 2) isolates the auditing subject to establish precise 2D spatial boundaries. Step 2: 3D Geometric Uplifting (MegaSaM) reconstructs the physical environment and projects pixels into a unified 3D world-space pointmap. Step 3: 3D Structural Anchoring (CoTracker3) tracks dense pixel-space anchors and lifts them into structurally meaningful 3D trajectories for subsequent rigidity auditing.

PDI-Bench pipeline.

PDI-Bench geometric metrics visualization

The three key perspectives for geometric consistency.

Examples

Bear

Black Swan

Kite Surf

Stroller

Soccer Ball

Car Shadow

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}