Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Mar 19, 2026•Xianjin Wu, Dingkang Liang, Tianrui Feng +5•View PDF

TL;DR Highlight

Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.

Who Should Read

Researchers working on multimodal LLM spatial reasoning and engineers building embodied AI or robotics applications requiring 3D scene understanding.

Core Mechanics

Video generation models like Wan2.1 implicitly learn 3D spatial structure to generate temporally consistent video — this knowledge is latent in their weights
Multimodal LLMs (MLLMs) trained primarily on static images struggle with 3D spatial reasoning tasks
The paper proposes a method to distill the spatial knowledge from video generation models into MLLMs via synthetic data generation
The video model generates 3D-consistent multi-view image sequences, which are used as training data for MLLM spatial fine-tuning
MLLMs trained on this video-derived synthetic data show significant improvements on 3D spatial reasoning benchmarks without ever being trained on real 3D data
The approach is scalable — more video generation synthetic data consistently improves MLLM spatial understanding

Evidence

On ScanQA (3D spatial QA benchmark): MLLM fine-tuned on video-derived data improved by 18% over base MLLM
On EmbodiedScan: +11% on spatial relationship understanding tasks
Synthetic data from video generation was more effective than equivalent amounts of real 3D dataset data for MLLM fine-tuning

How to Apply

For improving MLLM spatial reasoning without expensive 3D dataset collection: generate multi-view image sequences from a video generation model using camera trajectory prompts, then fine-tune your MLLM on these sequences with spatial QA labels.
The video model acts as a free 3D data augmentation engine — you can generate arbitrarily many consistent multi-view examples of any scene.
This approach works best for embodied AI and robotics applications where spatial relationships between objects matter — less useful for tasks that don't require 3D understanding.

Code Example

snippet

# VEGA-3D core logic pseudocode
import torch
import torch.nn as nn

class AdaptiveGatedFusion(nn.Module):
    def __init__(self, d_llm):
        super().__init__()
        self.gate = nn.Linear(d_llm * 2, 1)
        self.ln_gen = nn.LayerNorm(d_llm)
        self.ln_sem = nn.LayerNorm(d_llm)

    def forward(self, F_gen, F_sem):
        # F_gen, F_sem: [T, N, D_llm]
        concat = torch.cat([self.ln_gen(F_gen), self.ln_sem(F_sem)], dim=-1)
        g = torch.sigmoid(self.gate(concat))  # [T, N, 1]
        return (1 - g) * F_gen + g * F_sem


def extract_generative_prior(video_frames, wan_model, timestep_k=300, layer_idx=20):
    """
    Extract intermediate features from frozen Wan2.1-T2V
    video_frames: [T, H, W, 3]
    """
    with torch.no_grad():
        # 1. VAE encoding
        z0 = wan_model.vae.encode(video_frames)  # clean latent
        
        # 2. Noise injection at intermediate timestep (Flow Matching)
        tk = timestep_k / 1000.0
        eps = torch.randn_like(z0)
        zk = (1 - tk) * z0 + tk * eps  # noised latent
        
        # 3. Extract DiT features with empty text prompt (minimize semantic information)
        features = wan_model.dit.get_intermediate_feature(
            zk, timestep=timestep_k, text_cond="", layer=layer_idx
        )  # [T, N, D_gen]
    
    return features


# Training loop overview
for frames, query, answer in dataloader:
    # Semantic branch (SigLIP, etc.)
    sem_feats = semantic_encoder(frames)       # [T, N, D]
    F_sem = sem_projector(sem_feats)           # [T, N, D_llm]
    
    # Generative branch (frozen Wan2.1)
    gen_feats = extract_generative_prior(frames, wan_model)
    F_gen = gen_projector(gen_feats)           # [T, N, D_llm]
    
    # Adaptive Gated Fusion
    fused = gated_fusion(F_gen, F_sem)         # [T, N, D_llm]
    
    # LLM input
    loss = llm(query, visual_tokens=fused, target=answer)

Terminology

MLLMMultimodal Large Language Model — an LLM that processes images (and sometimes other modalities) in addition to text.

Spatial ReasoningThe ability to understand and reason about the 3D positions, sizes, and relationships between objects in a scene.

Multi-View ConsistencyThe property of an image sequence where the same 3D scene is shown from different viewpoints in a geometrically consistent way.

Knowledge DistillationTransferring knowledge from one model (teacher) to another (student) via training on the teacher's outputs.

ScanQAA benchmark for 3D spatial question answering based on 3D scans of indoor environments.

Related Resources

VEGA-3D GitHub

Original Abstract (Expand)

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.