Geometry-Guided Camera Motion Understanding in VideoLLMs
TL;DR Highlight
VideoLLMs struggle to recognize camera movements (pan/tilt/dolly) — injecting camera motion info derived from 3D geometry models as prompts fixes it.
Who Should Read
Researchers working on video understanding with LLMs, and developers building video analysis systems that need reliable camera motion awareness.
Core Mechanics
- Identified that VideoLLMs fail to accurately recognize camera motion types (pan, tilt, zoom, dolly, orbit)
- Camera motion understanding requires distinguishing camera movement from object movement — a challenging geometric problem
- Proposed extracting camera motion parameters from depth estimation and structure-from-motion (SfM) models
- These extracted parameters are formatted as natural language prompts and injected alongside video frames
- The camera motion-augmented prompts significantly improve VideoLLM performance on motion understanding tasks
- No fine-tuning required — works by augmenting the input context with geometric information
Evidence
- VideoLLMs with camera motion prompts outperform baselines on camera motion recognition benchmarks
- Performance improvement is most pronounced for complex multi-axis camera movements
- The approach works zero-shot on VideoLLMs without any fine-tuning
- Ablation confirms that camera motion info is the key signal, not other geometric features
How to Apply
- For any video analysis pipeline requiring camera motion understanding, add a camera parameter extraction step using depth + SfM models
- Format the extracted camera motion parameters as a structured text prefix before your main video analysis prompt
- This approach is particularly valuable for cinematography analysis, action recognition, and autonomous driving video understanding
Code Example
# Structured prompt template for injecting camera motion information into VideoLLM
# Inject per-second camera motion predicted by VGGT into the prompt
per_second_motions = [
"pan-left", # 1 second
"static", # 2 seconds
"pan-right", # 3 seconds
"pan-left and tilt-up", # 4 seconds (compound motion)
"static" # 5 seconds
]
motion_header = "Per-second camera motion: [" + ", ".join(per_second_motions) + "]"
prompt = f"""Here are {N} consecutive video frames.
They are evenly sampled at a frame rate of {fps} FPS.
{motion_header}
Describe this video using the filmmaker's language, highlighting the lighting,
framing, video composition, and especially camera usage that connects
different frames. For example: "At the beginning, <video content>; then
<camera motion>, <video content>; ...; finally, <camera motion>, <video
content>". Make your description in a paragraph."""
# Prompt for CameraMotionVQA benchmark evaluation
vqa_prompt = """<video>
Identify the camera motion depicted in the video using standard cinematographic terminology.
Options:
(A) pan-left
(B) dolly-in and pan-right
(C) static
(D) tilt-up and truck-left
"""Terminology
Related Resources
Original Abstract (Expand)
Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.