생성 모델은 공간을 안다: Video Generation Model의 implicit 3D prior를 3D Scene Understanding에 활용하기

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Mar 19, 2026•Xianjin Wu, Dingkang Liang, Tianrui Feng +5•View PDF

TL;DR Highlight

비디오 생성 모델(Wan2.1)이 학습한 암묵적 3D 공간 지식을 추출해서 MLLM의 공간 추론 능력을 끌어올리는 plug-and-play 프레임워크.

Who Should Read

MLLM 기반 3D 장면 이해나 공간 추론 파이프라인을 개발 중인 AI 엔지니어. 포인트 클라우드 같은 명시적 3D 데이터 없이도 공간 인식 능력을 높이고 싶은 경우에 특히 유용.

Core Mechanics

비디오 생성 모델(Wan2.1-T2V 1.3B)은 시간적으로 일관된 영상을 만들기 위해 3D 기하 구조와 물리 법칙을 내부적으로 학습함 — 별도의 3D 지도 학습 없이도 공간 prior가 형성됨
VEGA-3D는 frozen(학습 안 하는) 비디오 생성 모델에 노이즈를 주입해 중간 DiT 레이어의 피처를 뽑고, 이를 semantic encoder 피처와 token-level Adaptive Gated Fusion으로 합쳐 MLLM에 주입
DiT 기반 생성 모델(Wan2.1)은 UNet 기반 모델(SVD, Vmem)보다 multi-view 일관성 점수가 훨씬 높음(96%+) — 이 점수가 downstream 3D 성능과 강한 양의 상관관계를 보임
최적 피처 추출 지점은 확산 과정의 중간 타임스텝(k=300)과 중간 DiT 레이어(20번째)로, 너무 깨끗하거나 너무 노이즈가 많으면 공간 정보가 줄어듦
Adaptive Gated Fusion이 단순 Add나 Concatenate보다 일관되게 우수 — 각 토큰 위치마다 semantic vs. geometric 기여도를 동적으로 조절하는 게 핵심
장면당 한 번만 생성 피처를 캐싱하면 추론 비용을 대폭 절감할 수 있어 실용적으로 배포 가능

Evidence

3D scene understanding 벤치마크 전체 Avg. Rank 1.8위 달성 (2위 3DRS 2.2, 3위 LLaVA-4D 2.8) — ScanRefer Acc@0.25 63.2, Multi3DRefer F1@0.25 60.8, ScanQA CIDEr 106.3, SQA3D EM 61.3
Video-3D LLM 베이스라인 대비 ScanRefer Acc@0.25 58.1→63.2(+5.1p), SQA3D EM 58.6→61.3(+2.7p) 개선
VSI-Bench 공간 추론에서 Qwen2.5VL-7B 베이스라인 48.9→50.5(+1.6p) 향상, 독점 모델 Gemini-1.5-Pro(45.4)와 SpaceR-7B(45.6)를 오픈소스 7B 규모로 능가
LIBERO 로봇 조작 벤치마크 평균 성공률 97.0% → 97.3%, LIBERO-Object에서 98.3→99.4(+1.1p) 개선

How to Apply

기존 MLLM(예: Qwen2.5VL, Video-3D LLM) 위에 plug-and-play로 붙이는 경우: frozen Wan2.1-T2V 1.3B를 별도 브랜치로 두고, 입력 프레임을 VAE로 인코딩 → k=300 타임스텝에서 노이즈 주입 → 20번째 DiT 레이어 피처 추출 → Adaptive Gated Fusion으로 semantic 피처와 합친 뒤 LLM에 입력
추론 비용이 걱정되는 경우: 장면 1개당 생성 피처를 오프라인으로 한 번만 계산해 캐싱해두면, 같은 장면에 여러 질문이 들어와도 재계산 없이 재사용 가능 — 논문 기준 캐싱 시 832×480 기준 latency 173ms → 70ms로 감소
어떤 생성 모델 백본을 쓸지 고를 때: multi-view correspondence score가 높은 DiT 기반 모델(Wan2.1 등)을 선택. UNet 기반(SVD, Stable Diffusion 등)은 공간 일관성이 낮아 효과가 제한적

Code Example

snippet

# VEGA-3D 핵심 로직 의사코드
import torch
import torch.nn as nn

class AdaptiveGatedFusion(nn.Module):
    def __init__(self, d_llm):
        super().__init__()
        self.gate = nn.Linear(d_llm * 2, 1)
        self.ln_gen = nn.LayerNorm(d_llm)
        self.ln_sem = nn.LayerNorm(d_llm)

    def forward(self, F_gen, F_sem):
        # F_gen, F_sem: [T, N, D_llm]
        concat = torch.cat([self.ln_gen(F_gen), self.ln_sem(F_sem)], dim=-1)
        g = torch.sigmoid(self.gate(concat))  # [T, N, 1]
        return (1 - g) * F_gen + g * F_sem


def extract_generative_prior(video_frames, wan_model, timestep_k=300, layer_idx=20):
    """
    frozen Wan2.1-T2V에서 중간 피처 추출
    video_frames: [T, H, W, 3]
    """
    with torch.no_grad():
        # 1. VAE 인코딩
        z0 = wan_model.vae.encode(video_frames)  # clean latent
        
        # 2. 중간 타임스텝에서 노이즈 주입 (Flow Matching)
        tk = timestep_k / 1000.0
        eps = torch.randn_like(z0)
        zk = (1 - tk) * z0 + tk * eps  # noised latent
        
        # 3. 빈 텍스트 프롬프트로 DiT 피처 추출 (의미 정보 최소화)
        features = wan_model.dit.get_intermediate_feature(
            zk, timestep=timestep_k, text_cond="", layer=layer_idx
        )  # [T, N, D_gen]
    
    return features


# 학습 루프 개요
for frames, query, answer in dataloader:
    # Semantic branch (SigLIP 등)
    sem_feats = semantic_encoder(frames)       # [T, N, D]
    F_sem = sem_projector(sem_feats)           # [T, N, D_llm]
    
    # Generative branch (frozen Wan2.1)
    gen_feats = extract_generative_prior(frames, wan_model)
    F_gen = gen_projector(gen_feats)           # [T, N, D_llm]
    
    # Adaptive Gated Fusion
    fused = gated_fusion(F_gen, F_sem)         # [T, N, D_llm]
    
    # LLM 입력
    loss = llm(query, visual_tokens=fused, target=answer)

Terminology

MLLM텍스트뿐 아니라 이미지·영상도 이해하는 대형 언어 모델. GPT-4o나 LLaVA 같은 멀티모달 AI.

DiT (Diffusion Transformer)이미지/영상 생성 모델의 핵심 구조 중 하나. 기존 UNet 대신 Transformer를 사용해 전역적 맥락을 더 잘 파악함. Wan2.1, SORA 등이 사용.

Flow Matching영상 생성 시 깨끗한 이미지 → 노이즈 방향의 벡터 흐름을 학습하는 훈련 기법. Diffusion의 일종으로 생각하면 됨.

implicit 3D prior명시적으로 3D 데이터를 학습하지 않았는데도 모델 내부에 3D 공간 구조 정보가 자연스럽게 녹아든 것. 마치 수많은 사진을 보면서 자연스럽게 입체감을 배우는 것과 유사.

plug-and-play기존 모델 구조를 크게 바꾸지 않고 모듈처럼 끼워서 쓸 수 있는 방식. USB처럼 꽂기만 하면 작동.

VAE (Variational Autoencoder)이미지를 압축된 저차원 표현으로 인코딩하고 다시 복원하는 신경망. 영상 생성 모델에서 픽셀 대신 압축 공간에서 작업하기 위해 사용.

Adaptive Gated Fusion두 가지 정보(의미 피처 + 공간 피처)를 단순히 평균 내는 게 아니라, 각 위치마다 '어느 쪽 정보를 더 믿을지' 가중치를 동적으로 결정해 합치는 방식.

Related Resources

VEGA-3D GitHub

Original Abstract (Expand)

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.