Video Summarization with Large Language Models
TL;DR Highlight
Converting video frames to text captions, then having an LLM score importance for video summarization — achieving SotA over traditional visual feature-based approaches.
Who Should Read
ML engineers building video content search or highlight extraction pipelines. Developers looking to apply multimodal LLMs to real-world video processing.
Core Mechanics
- Each frame is captioned into one sentence by LLaVA-1.5-7B → Llama-2-13B scores center frame importance (0-10) using a sliding window of 7 frames
- Extracting internal embeddings (RMS Norm layer outputs) instead of final text answers, then integrating global context via Self-Attention, outperforms text-only scoring
- In-context learning with just a few examples in the prompt significantly boosts scoring quality without any fine-tuning
Evidence
- SumMe benchmark: Kendall's τ 0.253 / Spearman's ρ 0.282, surpassing previous best (CSTA τ 0.246)
- TVSum benchmark: τ 0.211 / ρ 0.275, surpassing previous best (DMASum τ 0.203)
- Zero-shot MR.HiSum evaluation: τ/ρ both 0.440, beating VASNet (0.364) and PGL-SUM
How to Apply
- Building a video highlight extraction system: sequentially pipe multimodal LLM frame captions (LLaVA) → sliding window importance scoring (Llama, 7-frame window recommended).
- To reduce LLM API costs: sample keyframes at 1-2 second intervals instead of every frame, apply caption+scoring only to those, then interpolate scores for remaining frames.
- If fine-tuning resources are available: use the internal embedding extraction approach (RMS Norm layer outputs + Self-Attention) for better results than text-based scoring.
Code Example
# Core prompt structure (LLM π call example)
instruction = """
You are an intelligent chatbot designed to critically assess the importance
of a central frame within a specific context.
Evaluate the frame using:
1. Narrative Significance
2. Uniqueness and Novelty
3. Action and Dynamics
"""
def build_prompt(captions: list[str], center_idx: int) -> str:
"""Takes a list of captions within a sliding window and the center frame index to build a prompt"""
frames_text = "\n".join(
f"#{i+1}: {cap}" for i, cap in enumerate(captions)
)
return f"""{instruction}
Please evaluate the importance score of the central frame #{center_idx+1}
in following {len(captions)} frames. Be stingy with scores.
——
{frames_text}
——
Provide your score as an integer 0-10.
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION.
Answer score:"""
# Usage example (window_size=7)
window_size = 7
for t in range(len(all_captions)):
start = max(0, t - window_size // 2)
end = min(len(all_captions), t + window_size // 2 + 1)
window_captions = all_captions[start:end]
center = t - start
prompt = build_prompt(window_captions, center)
# score = llm.generate(prompt) # Call Llama-2-13B or similarTerminology
Original Abstract (Expand)
The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.