LLM 기반 Video Summarization: 캡션과 로컬-글로벌 문맥으로 핵심 프레임 추출

Video Summarization with Large Language Models

Apr 15, 2025•Min Jung Lee, Dayoung Gong, Minsu Cho•View PDF

TL;DR Highlight

영상 프레임을 텍스트 캡션으로 변환한 뒤 LLM이 중요도를 매겨 영상 요약하는 방법으로, 기존 시각 특징 기반 대비 SotA 달성.

Who Should Read

비디오 콘텐츠 검색·하이라이트 추출 파이프라인을 개발 중인 ML 엔지니어. 특히 멀티모달 LLM을 실제 영상 처리에 적용해보려는 개발자.

Core Mechanics

각 프레임을 LLaVA-1.5-7B로 한 문장 캡션으로 변환 → Llama-2-13B가 슬라이딩 윈도우(7프레임) 단위로 중심 프레임 중요도 점수(0~10) 부여
LLM의 최종 답변 텍스트 대신 내부 임베딩(RMS Norm 레이어 출력)을 뽑아 Self-Attention으로 글로벌 문맥까지 통합 → 텍스트 답변만 쓸 때보다 성능 더 높음
In-context learning으로 파인튜닝 없이도 경쟁력 있는 성능, 글로벌 Self-Attention 블록만 학습(M-LLM, LLM은 frozen)
프롬프트 전략 실험: 중심/배경 분리 설명보다 '한 문장 전체 묘사'가, 텍스트 요약보다 '숫자 점수 직접 요청'이 더 좋은 결과
Zero-shot으로 MR.HiSum 데이터셋에 전이 시 기존 방법 대비 우수한 일반화 성능 확인

Evidence

SumMe 벤치마크: Kendall's τ 0.253 / Spearman's ρ 0.282로 기존 최고 모델(CSTA τ 0.246) 초과
TVSum 벤치마크: τ 0.211 / ρ 0.275로 기존 최고(DMASum τ 0.203) 초과
Zero-shot MR.HiSum 평가: τ/ρ 모두 0.440으로 VASNet(0.364), PGL-SUM(0.375), DSNet(0.362) 대비 명확한 우위
LLM 단독(zero-shot, 글로벌 aggregator 없음) τ 0.170 → LLMVS τ 0.253으로 글로벌 문맥 통합만으로 약 49% 향상

How to Apply

영상 하이라이트 추출 시스템 구축: LLaVA 같은 멀티모달 LLM으로 프레임별 캡션 생성 → 슬라이딩 윈도우(7프레임 권장)로 Llama에 중요도 점수 요청하는 파이프라인을 순차 구성
LLM API 호출 비용이 부담되는 경우: 전체 프레임 대신 1~2초 간격으로 샘플링한 키프레임에만 캡션+점수 요청, 점수 보간으로 나머지 프레임 커버
파인튜닝 없이 즉시 테스트하려면: Figure 3 / Table C1의 프롬프트 구조(Instruction + 3개 예시 + Query)를 그대로 사용, 예시는 직접 레이블링한 3개 샘플로 교체

Code Example

snippet

# 핵심 프롬프트 구조 (LLM π 호출 예시)
instruction = """
You are an intelligent chatbot designed to critically assess the importance 
of a central frame within a specific context.
Evaluate the frame using:
1. Narrative Significance
2. Uniqueness and Novelty  
3. Action and Dynamics
"""

def build_prompt(captions: list[str], center_idx: int) -> str:
    """슬라이딩 윈도우 내 캡션 리스트와 중심 프레임 인덱스를 받아 프롬프트 생성"""
    frames_text = "\n".join(
        f"#{i+1}: {cap}" for i, cap in enumerate(captions)
    )
    return f"""{instruction}

Please evaluate the importance score of the central frame #{center_idx+1} 
in following {len(captions)} frames. Be stingy with scores.
——
{frames_text}
——
Provide your score as an integer 0-10.
DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION.
Answer score:"""

# 사용 예시 (window_size=7)
window_size = 7
for t in range(len(all_captions)):
    start = max(0, t - window_size // 2)
    end = min(len(all_captions), t + window_size // 2 + 1)
    window_captions = all_captions[start:end]
    center = t - start
    prompt = build_prompt(window_captions, center)
    # score = llm.generate(prompt)  # Llama-2-13B 등 호출

Terminology

M-LLM이미지/영상을 보고 텍스트를 이해·생성하는 멀티모달 LLM. LLaVA처럼 '눈'이 달린 언어모델이라고 보면 됨.

In-context learning파인튜닝 없이 프롬프트 안에 예시 몇 개를 넣어 LLM이 원하는 방식으로 동작하게 하는 기법. 몇 가지 예제를 보여주면 바로 따라하는 것.

Sliding window긴 시퀀스를 일정 크기 창으로 잘라 순서대로 처리하는 방법. 이 논문에서는 7프레임씩 묶어 중심 프레임 중요도를 평가.

Self-Attention시퀀스 내 모든 원소가 서로를 참조해 문맥을 파악하는 메커니즘. 트랜스포머의 핵심으로, 멀리 떨어진 프레임 간 관계도 잡아낼 수 있음.

RMS NormLLM 레이어 사이에 있는 정규화 연산. 이 레이어 직후 임베딩을 추출하면 언어 특화 편향이 적어 다운스트림 작업에 더 유용한 표현이 나옴.

KTSKernel Temporal Segmentation의 약자. 영상을 의미 있는 샷(장면) 단위로 자동 분할하는 알고리즘.

LoRA모델 전체를 재학습하지 않고 작은 저랭크 행렬만 추가해 파인튜닝하는 기법. 파라미터 수를 크게 줄여 적은 GPU로도 LLM 파인튜닝 가능.

Original Abstract (Expand)

The exponential increase in video content poses significant challenges in terms of efficient navigation, search, and retrieval, thus requiring advanced video summarization techniques. Existing video summarization methods, which heavily rely on visual features and temporal dynamics, often fail to capture the semantics of video content, resulting in incomplete or incoherent summaries. To tackle the challenge, we propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs), expecting that the knowledge learned from massive data enables LLMs to evaluate video frames in a manner that better aligns with diverse semantics and human judgments, effectively addressing the inherent subjectivity in defining keyframes. Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (M-LLM) and then assesses the importance of each frame using an LLM, based on the captions in its local context. These local importance scores are refined through a global attention mechanism in the entire context of video captions, ensuring that our summaries effectively reflect both the details and the overarching narrative. Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks, highlighting the potential of LLMs in the processing of multimedia content.