VideoAgent: Long-form Video Understanding with Large Language Model as Agent
TL;DR Highlight
An iterative frame selection system using GPT-4 as an agent that achieves SOTA on long videos by looking at an average of only 8 frames.
Who Should Read
ML engineers building video analysis pipelines or developing multimodal AI agents. Developers wanting to combine LLM + vision models to solve complex tasks.
Core Mechanics
- Agent system using GPT-4 as central agent and CLIP + VLM (vision-language models) as tools — instead of processing the full video at once, iteratively searches for only needed frames
- 3-step iterative loop: ① predict answer with current info → ② Self-reflection to judge confidence (1-3) → ③ if insufficient, LLM directly specifies which segment and frame type needed and searches via CLIP
- Video divided into segments for search to prevent temporal confusion — greatly reduces mis-retrieval for time-conditional queries like 'the sofa after leaving the room'
- CLIP pre-caches image features and reuses them for each text query — accounts for only 1.9% of total computation — extremely efficient
- LLM comparison: GPT-4 (60.2%) > GPT-3.5 (48.8%) > LLaMA-2-70B (45.4%) > Mixtral-8x7B (37.8%) — structured JSON output capability is the key performance differentiator
- Auto-adjusts per question type: descriptive (5.9 frames) < causal reasoning (7.1 frames) < temporal reasoning (7.8 frames) — harder questions look at more frames
Evidence
- EgoSchema full set 54.1% — +3.8% vs previous SOTA LLoVi (50.3%), using 8.4 frames vs LLoVi's 180 frames (20x difference)
- NExT-QA validation set 71.3% — +3.6% vs LLoVi (67.7%), surpasses supervised SOTA HiTeA (63.1%) in zero-shot
- Removing Self-reflection: frame count increases 8.4→11.8 and accuracy drops 60.2%→59.6% — looking at more frames actually hurts
- Removing segment selection: accuracy drops 60.2%→56.6% (3.6%) — temporal segment specification is critical
How to Apply
- Instead of 'retrieve everything at once' in RAG pipelines, switch to a structure where the LLM looks at current context, identifies missing info, and iterates additional retrieval — especially effective for long documents or videos with long context
- When building multimodal agents: VLM converts images → text → LLM reasons on text only → CLIP searches for relevant images — this module separation pattern enables visual understanding without GPT-4V
- Apply Self-reflection pattern: after having the LLM give an answer, add a separate 2-step prompt asking 'is this information sufficient?' — enables early termination and reduces unnecessary additional retrieval
Code Example
Terminology
Related Resources
Original Abstract (Expand)
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.