Video Summarization with Large Language Models

Apr 15, 2025•Min Jung Lee, Dayoung Gong, Minsu Cho•View PDF

TL;DR Highlight

Converting video frames to text captions, then having an LLM score importance for video summarization — achieving SotA over traditional visual feature-based approaches.

Who Should Read

ML engineers building video content search or highlight extraction pipelines. Developers looking to apply multimodal LLMs to real-world video processing.

Core Mechanics

Each frame is captioned into one sentence by LLaVA-1.5-7B → Llama-2-13B scores center frame importance (0-10) using a sliding window of 7 frames
Extracting internal embeddings (RMS Norm layer outputs) instead of final text answers, then integrating global context via Self-Attention, outperforms text-only scoring
In-context learning with just a few examples in the prompt significantly boosts scoring quality without any fine-tuning

Evidence

SumMe benchmark: Kendall's τ 0.253 / Spearman's ρ 0.282, surpassing previous best (CSTA τ 0.246)
TVSum benchmark: τ 0.211 / ρ 0.275, surpassing previous best (DMASum τ 0.203)
Zero-shot MR.HiSum evaluation: τ/ρ both 0.440, beating VASNet (0.364) and PGL-SUM

How to Apply

Building a video highlight extraction system: sequentially pipe multimodal LLM frame captions (LLaVA) → sliding window importance scoring (Llama, 7-frame window recommended).
To reduce LLM API costs: sample keyframes at 1-2 second intervals instead of every frame, apply caption+scoring only to those, then interpolate scores for remaining frames.
If fine-tuning resources are available: use the internal embedding extraction approach (RMS Norm layer outputs + Self-Attention) for better results than text-based scoring.

Code Example

snippet

# Core prompt structure (LLM π call example)
instruction = """
You are an intelligent chatbot designed to critically assess the importance 
of a central frame within a specific context.
Evaluate the frame using:
1. Narrative Significance
2. Uniqueness and Novelty  
3. Action and Dynamics
"""

def build_prompt(captions: list[str], center_idx: int) -> str:
    """Takes a list of captions within a sliding window and the center frame index to build a prompt"""
    frames_text = "\n".join(
        f"#{i+1}: {cap}" for i, cap in enumerate(captions)
    )
    return f"""{instruction}

Please evaluate the importance score of the central frame #{center_idx+1} 
in following {len(captions)} frames. Be stingy with scores.
——
{frames_text}
——
Provide your score as an integer 0-10.
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION.
Answer score:"""

# Usage example (window_size=7)
window_size = 7
for t in range(len(all_captions)):
    start = max(0, t - window_size // 2)
    end = min(len(all_captions), t + window_size // 2 + 1)
    window_captions = all_captions[start:end]
    center = t - start
    prompt = build_prompt(window_captions, center)
    # score = llm.generate(prompt)  # Call Llama-2-13B or similar

Terminology

M-LLMMultimodal LLM that can see images/video and understand/generate text. Think of it as a language model with 'eyes', like LLaVA.

In-context learningGetting the LLM to behave as desired by putting a few examples in the prompt, without any fine-tuning. Show a few examples and it follows along.

Sliding windowProcessing a long sequence by looking at a fixed-size chunk that slides along one step at a time, like reading through a long document with a magnifying glass.

Original Abstract (Expand)

The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.