VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Mar 15, 2024•Xiaohan Wang, Yuhui Zhang, Orr Zohar +1•View PDF

TL;DR Highlight

An iterative frame selection system using GPT-4 as an agent that achieves SOTA on long videos by looking at an average of only 8 frames.

Who Should Read

ML engineers building video analysis pipelines or developing multimodal AI agents. Developers wanting to combine LLM + vision models to solve complex tasks.

Core Mechanics

Agent system using GPT-4 as central agent and CLIP + VLM (vision-language models) as tools — instead of processing the full video at once, iteratively searches for only needed frames
3-step iterative loop: ① predict answer with current info → ② Self-reflection to judge confidence (1-3) → ③ if insufficient, LLM directly specifies which segment and frame type needed and searches via CLIP
Video divided into segments for search to prevent temporal confusion — greatly reduces mis-retrieval for time-conditional queries like 'the sofa after leaving the room'
CLIP pre-caches image features and reuses them for each text query — accounts for only 1.9% of total computation — extremely efficient
LLM comparison: GPT-4 (60.2%) > GPT-3.5 (48.8%) > LLaMA-2-70B (45.4%) > Mixtral-8x7B (37.8%) — structured JSON output capability is the key performance differentiator
Auto-adjusts per question type: descriptive (5.9 frames) < causal reasoning (7.1 frames) < temporal reasoning (7.8 frames) — harder questions look at more frames

Evidence

EgoSchema full set 54.1% — +3.8% vs previous SOTA LLoVi (50.3%), using 8.4 frames vs LLoVi's 180 frames (20x difference)
NExT-QA validation set 71.3% — +3.6% vs LLoVi (67.7%), surpasses supervised SOTA HiTeA (63.1%) in zero-shot
Removing Self-reflection: frame count increases 8.4→11.8 and accuracy drops 60.2%→59.6% — looking at more frames actually hurts
Removing segment selection: accuracy drops 60.2%→56.6% (3.6%) — temporal segment specification is critical

How to Apply

Instead of 'retrieve everything at once' in RAG pipelines, switch to a structure where the LLM looks at current context, identifies missing info, and iterates additional retrieval — especially effective for long documents or videos with long context
When building multimodal agents: VLM converts images → text → LLM reasons on text only → CLIP searches for relevant images — this module separation pattern enables visual understanding without GPT-4V
Apply Self-reflection pattern: after having the LLM give an answer, add a separate 2-step prompt asking 'is this information sufficient?' — enables early termination and reduces unnecessary additional retrieval

Code Example

snippet

Terminology

VLMVision-Language Model. A model that describes images in text. Acts as a translator converting visual information into language — e.g., 'a dog carrying a bag in the photo.'

CLIPA model enabling comparison of text and images in the same space. Calculates similarity between a phrase like 'scene of a dog running' and actual video frames to find the most relevant frame.

Self-reflectionA technique where the LLM reviews its own response. Like checking after an exam: 'do I really know this answer and answered confidently?'

Chain-of-thoughtA prompting technique where LLMs think through step-by-step intermediate reasoning before reaching conclusions. Same as writing out math problem solutions.

Zero-shotSolving tasks with no training examples — just instructions. Especially powerful when zero-shot outperforms supervised approaches.

Related Resources

Original Abstract (Expand)

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.