Mem-Gallery: MLLM 에이전트의 멀티모달 장기 대화 메모리 벤치마크

Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Jan 7, 2026•Yuanchen Bei, Tianxin Wei, Xuying Ning +7•View PDF

TL;DR Highlight

이미지+텍스트가 섞인 수십 세션 대화에서 AI가 기억을 얼마나 잘 유지·추론·갱신하는지 체계적으로 측정하는 벤치마크.

Who Should Read

멀티모달 AI 에이전트나 챗봇에 장기 메모리 기능을 붙이려는 ML 엔지니어 및 연구자. 특히 RAG 기반 메모리 시스템의 한계를 파악하고 개선 방향을 고민하는 개발자.

Core Mechanics

이미지를 텍스트 캡션으로 변환해서 저장하는 방식은 손실이 크다 — 시각 패턴을 직접 기억에 보존해야 Test-Time Learning(추론 시 새 예시를 보고 적응하는 능력)에서 제대로 동작함
단순한 멀티모달 RAG(MuRAG)가 복잡한 구조의 메모리 시스템(NGM, AUGUSTUS)보다 전체 성능이 오히려 높다 — 아키텍처 복잡도보다 정보를 어떻게 보존하느냐가 더 중요
이미지를 그냥 다 쌓으면(Full Memory MM) 텍스트만 쓰는 것보다 성능이 낮아진다 — 비주얼 토큰이 너무 많아 노이즈가 생기고 정작 중요한 텍스트가 밀려남
검색 K값을 키우면 Recall은 오르지만 Precision이 급락해 실제 QA 성능은 K=10 근처에서 포화 또는 하락 — '많이 가져오기'보다 '정확히 가져오기'가 핵심
지식 충돌 감지(Conflict Detection)와 지식 갱신(Knowledge Resolution)은 모든 메모리 시스템이 엉망이다 — 더 강한 백본 모델을 써도 개선 폭이 작아 근본적인 설계 변화가 필요
강력한 백본(Gemini-2.5-Flash-Lite 등)을 쓰면 추출 성능은 오르지만, 추론·지식관리 한계는 거의 그대로 — 모델 크기가 아닌 메모리 구조 자체의 문제

Evidence

MuRAG가 최고 성능 텍스트 메모리 대비 전체 F1 +11.85%, 시각 중심 검색(VS) +12.29%, Test-Time Learning(TTL) +29.06% 향상
Full Memory(Multimodal)는 Full Memory(Text) 대비 전체 F1 -8.08%, MuRAG 대비 -51.85% — 이미지를 무작정 쌓으면 역효과
K=10→K=20으로 늘릴 때 MuRAG Recall@K: 86%→92%로 증가하지만, 전체 QA F1은 오히려 소폭 하락 또는 정체
13개 메모리 시스템 전체에서 Conflict Detection 최고 F1은 0.37 수준 — 랜덤 기준선과 거의 차이 없음

How to Apply

RAG 기반 메모리를 구축할 때 이미지를 캡션으로만 저장하지 말고, 임베딩 벡터 형태로 원본 이미지도 함께 인덱싱하라 — 특히 사용자가 '저번에 보여준 그 사진'을 참조하는 시나리오에서 차이가 큼
검색 K값을 무작정 늘리지 말고 K=10 내외로 고정하되, Precision 중심의 리랭킹(reranking) 레이어를 추가하는 방향을 먼저 시도하라
챗봇에 정보 수정 기능(예: '아까 말한 거 틀렸어, 사실은...')을 넣을 때 단순 append가 아닌 기존 메모리 엔트리를 명시적으로 삭제·갱신하는 로직을 별도로 구현해야 한다 — 현재 모든 시스템이 이 부분에서 실패함

Code Example

snippet

# Mem-Gallery 벤치마크 평가 환경 세팅 (MemEngine 기반)
# https://github.com/nuster1128/MemEngine

from memengine import MemoryAgent, MuRAGConfig

# 멀티모달 메모리 에이전트 초기화
config = MuRAGConfig(
    embedding_model="GME-Qwen2-VL-2B-Instruct",
    retrieval_k=10,  # K=10이 precision/recall 트레이드오프 최적
    store_raw_images=True  # 캡션만 저장하지 말고 이미지 임베딩도 함께 보존
)
agent = MemoryAgent(backbone="Qwen2.5-VL-7B-Instruct", memory_config=config)

# 세션별로 대화 스트리밍
for session in multi_session_conversation:
    for turn in session.turns:
        agent.observe(text=turn.text, image=turn.image)  # 멀티모달 입력
        
# 평가 시 쿼리
result = agent.query(
    question="지난번에 보여준 강아지 사진 속 품종이 뭐였어?",
    query_image=current_image  # 시각 쿼리도 함께
)
print(result)

Terminology

MLLM텍스트뿐 아니라 이미지도 이해하는 대형 언어 모델. GPT-4V, Gemini, Qwen-VL처럼 눈도 달린 AI라고 생각하면 됨.

Multi-session Memory여러 번의 대화 세션에 걸쳐 정보를 기억하는 능력. 오늘 얘기한 내용을 다음 주 대화에서도 기억하는 것.

Test-Time Learning모델 파라미터를 바꾸지 않고 추론 시점에 새 예시를 보고 즉석에서 적응하는 능력. 학교 시험 중에 예제를 보고 새 문제 푸는 것과 비슷.

MuRAG멀티모달 검색 증강 생성(Multimodal Retrieval-Augmented Generation). 이미지와 텍스트를 함께 벡터로 인코딩해서 유사한 기억을 꺼내 답변하는 방법.

Knowledge Resolution대화 중에 사용자가 이전 정보를 수정했을 때 메모리를 올바르게 업데이트하는 능력. '아까 말한 거 틀렸어'를 제대로 반영하는 것.

Conflict Detection새로 들어온 정보가 기존 메모리와 충돌하는지 감지하는 능력. '이 사람이 저번엔 A라고 했는데 지금 B라고 하네'를 알아채는 것.

Answer Refusal질문에 대한 정보가 대화 기록에 없을 때 모르겠다고 적절히 거부하는 능력. 없는 기억을 지어내는 환각(hallucination) 방지와 직결됨.

Related Resources

Original Abstract (Expand)

Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.