Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
TL;DR Highlight
Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.
Who Should Read
Researchers working on multimodal LLM spatial reasoning and engineers building embodied AI or robotics applications requiring 3D scene understanding.
Core Mechanics
- Video generation models like Wan2.1 implicitly learn 3D spatial structure to generate temporally consistent video — this knowledge is latent in their weights
- Multimodal LLMs (MLLMs) trained primarily on static images struggle with 3D spatial reasoning tasks
- The paper proposes a method to distill the spatial knowledge from video generation models into MLLMs via synthetic data generation
- The video model generates 3D-consistent multi-view image sequences, which are used as training data for MLLM spatial fine-tuning
- MLLMs trained on this video-derived synthetic data show significant improvements on 3D spatial reasoning benchmarks without ever being trained on real 3D data
- The approach is scalable — more video generation synthetic data consistently improves MLLM spatial understanding
Evidence
- On ScanQA (3D spatial QA benchmark): MLLM fine-tuned on video-derived data improved by 18% over base MLLM
- On EmbodiedScan: +11% on spatial relationship understanding tasks
- Synthetic data from video generation was more effective than equivalent amounts of real 3D dataset data for MLLM fine-tuning
How to Apply
- For improving MLLM spatial reasoning without expensive 3D dataset collection: generate multi-view image sequences from a video generation model using camera trajectory prompts, then fine-tune your MLLM on these sequences with spatial QA labels.
- The video model acts as a free 3D data augmentation engine — you can generate arbitrarily many consistent multi-view examples of any scene.
- This approach works best for embodied AI and robotics applications where spatial relationships between objects matter — less useful for tasks that don't require 3D understanding.
Code Example
# VEGA-3D core logic pseudocode
import torch
import torch.nn as nn
class AdaptiveGatedFusion(nn.Module):
def __init__(self, d_llm):
super().__init__()
self.gate = nn.Linear(d_llm * 2, 1)
self.ln_gen = nn.LayerNorm(d_llm)
self.ln_sem = nn.LayerNorm(d_llm)
def forward(self, F_gen, F_sem):
# F_gen, F_sem: [T, N, D_llm]
concat = torch.cat([self.ln_gen(F_gen), self.ln_sem(F_sem)], dim=-1)
g = torch.sigmoid(self.gate(concat)) # [T, N, 1]
return (1 - g) * F_gen + g * F_sem
def extract_generative_prior(video_frames, wan_model, timestep_k=300, layer_idx=20):
"""
Extract intermediate features from frozen Wan2.1-T2V
video_frames: [T, H, W, 3]
"""
with torch.no_grad():
# 1. VAE encoding
z0 = wan_model.vae.encode(video_frames) # clean latent
# 2. Noise injection at intermediate timestep (Flow Matching)
tk = timestep_k / 1000.0
eps = torch.randn_like(z0)
zk = (1 - tk) * z0 + tk * eps # noised latent
# 3. Extract DiT features with empty text prompt (minimize semantic information)
features = wan_model.dit.get_intermediate_feature(
zk, timestep=timestep_k, text_cond="", layer=layer_idx
) # [T, N, D_gen]
return features
# Training loop overview
for frames, query, answer in dataloader:
# Semantic branch (SigLIP, etc.)
sem_feats = semantic_encoder(frames) # [T, N, D]
F_sem = sem_projector(sem_feats) # [T, N, D_llm]
# Generative branch (frozen Wan2.1)
gen_feats = extract_generative_prior(frames, wan_model)
F_gen = gen_projector(gen_feats) # [T, N, D_llm]
# Adaptive Gated Fusion
fused = gated_fusion(F_gen, F_sem) # [T, N, D_llm]
# LLM input
loss = llm(query, visual_tokens=fused, target=answer)Terminology
Related Resources
Original Abstract (Expand)
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.