Geometry-Guided Camera Motion Understanding in VideoLLMs

Mar 13, 2026•Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su•View PDF

TL;DR Highlight

VideoLLMs struggle to recognize camera movements (pan/tilt/dolly) — injecting camera motion info derived from 3D geometry models as prompts fixes it.

Who Should Read

Researchers working on video understanding with LLMs, and developers building video analysis systems that need reliable camera motion awareness.

Core Mechanics

Identified that VideoLLMs fail to accurately recognize camera motion types (pan, tilt, zoom, dolly, orbit)
Camera motion understanding requires distinguishing camera movement from object movement — a challenging geometric problem
Proposed extracting camera motion parameters from depth estimation and structure-from-motion (SfM) models
These extracted parameters are formatted as natural language prompts and injected alongside video frames
The camera motion-augmented prompts significantly improve VideoLLM performance on motion understanding tasks
No fine-tuning required — works by augmenting the input context with geometric information

Evidence

VideoLLMs with camera motion prompts outperform baselines on camera motion recognition benchmarks
Performance improvement is most pronounced for complex multi-axis camera movements
The approach works zero-shot on VideoLLMs without any fine-tuning
Ablation confirms that camera motion info is the key signal, not other geometric features

How to Apply

For any video analysis pipeline requiring camera motion understanding, add a camera parameter extraction step using depth + SfM models
Format the extracted camera motion parameters as a structured text prefix before your main video analysis prompt
This approach is particularly valuable for cinematography analysis, action recognition, and autonomous driving video understanding

Code Example

snippet

# Structured prompt template for injecting camera motion information into VideoLLM
# Inject per-second camera motion predicted by VGGT into the prompt

per_second_motions = [
    "pan-left",           # 1 second
    "static",             # 2 seconds
    "pan-right",          # 3 seconds
    "pan-left and tilt-up",  # 4 seconds (compound motion)
    "static"              # 5 seconds
]

motion_header = "Per-second camera motion: [" + ", ".join(per_second_motions) + "]"

prompt = f"""Here are {N} consecutive video frames.
They are evenly sampled at a frame rate of {fps} FPS.
{motion_header}
Describe this video using the filmmaker's language, highlighting the lighting,
framing, video composition, and especially camera usage that connects
different frames. For example: "At the beginning, <video content>; then
<camera motion>, <video content>; ...; finally, <camera motion>, <video
content>". Make your description in a paragraph."""

# Prompt for CameraMotionVQA benchmark evaluation
vqa_prompt = """<video>
Identify the camera motion depicted in the video using standard cinematographic terminology.
Options:
(A) pan-left
(B) dolly-in and pan-right
(C) static
(D) tilt-up and truck-left
"""

Terminology

Camera Motion TypesPan (horizontal rotation), tilt (vertical rotation), zoom (focal length change), dolly (physical camera movement), orbit (circular camera movement around subject).

Structure-from-Motion (SfM)A 3D reconstruction technique that estimates camera position and scene structure from multiple 2D images or video frames.

VideoLLMA large language model extended to process video inputs — frames + audio + text — for video understanding tasks.

Geometric Prompt InjectionAdding structured geometric information (camera parameters, depth maps) to LLM prompts to improve spatial understanding.

Related Resources

Original Abstract (Expand)

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.