MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
TL;DR Highlight
A benchmark measuring how memory degrades in 20 multimodal AIs including GPT-4o during long conversations — plus simple solutions.
Who Should Read
Engineers developing multimodal AI chatbots or conversational AI services. Especially developers wanting to improve model memory and reasoning performance during extended conversations.
Core Mechanics
- Released MMRC benchmark with 5,120 real user conversations and 28,720 manually labeled questions — uses actual user conversations, not GPT-generated ones
- 6 evaluation dimensions: Information Extraction (IE), Multi-turn Reasoning (CR), Information Update (IU), Image Management (IM), Long-term Memory Recall (MR), Answer Refusal (AR)
- GPT-4o scored only 4.08 — far below human level (4.81); open-source model average was 2.55, dropping dramatically in real conversations
- 4 common failure patterns: memory loss mid-conversation, failure to reflect updated information, errors propagating to later turns, generating forced answers instead of refusing
- Paradoxically, larger models have lower 'ability to refuse when not knowing' than smaller ones
- NOTE-TAKING strategy: recording key information in JSON format in real-time during conversations boosted LLaVA-1.5-7B memory recall from 0.22 to 2.36
Evidence
- GPT-4o MMRC overall score 4.08 vs human 4.81. Open-source model average 2.55, about half of human level
- NOTE-TAKING on LLaVA-1.5-7B: Memory Recall (MR) 0.22 to 2.36 (+2.14), Information Extraction (IE): 0.91 to 2.57 (+1.66)
- IE score 4.89 at 4-6 turns, drops to 1.07 at 20-22 turns — memory degrades sharply as conversations lengthen
- Text-based conversations scored 26.3% higher than image-based; memory-related abilities 34.6% higher
How to Apply
- For long-turn chatbots: apply NOTE-TAKING pattern — have a separate LLM extract key info as JSON each time a user message arrives, and include this note in context when responding.
- For image-containing conversations in multimodal chatbots: summarize image content as text and record in notes — improves image management score.
- For models generating forced answers instead of refusing: add explicit instruction in system prompt 'if not mentioned in the conversation, you must refuse.'
Code Example
# NOTE-TAKING strategy implementation example
SYSTEM_PROMPT_NOTETAKER = """
You are an intelligent chatbot designed to accurately record information from conversations in JSON format.
- Focus on the user's preferences and the facts within the conversation.
- If the user presents a new fact, it should overwrite the outdated fact.
- REMEMBER NOT TO RESPOND TO THE USER'S INPUT!
"""
USER_PROMPT_NOTETAKER = """
Extract key information from the input, including the user's preferences, events, and facts.
If you detect a change in information, overwrite the outdated content.
Here is the user input: {user_input}
Provide the Note in JSON format, for example:
{{
"user_info": {{
"health_conditions": ["High blood pressure", "Dairy allergy"]
}},
"preferences": {{
"food_preference": "Italian cuisine"
}},
"purpose": "Seeking meal suggestions."
}}
"""
def take_note(client, user_input, current_note):
"""Extract key information from user message and update the note"""
prompt = USER_PROMPT_NOTETAKER.format(user_input=user_input)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT_NOTETAKER},
{"role": "user", "content": f"Current note: {current_note}\n\n{prompt}"}
]
)
return response.choices[0].message.content
def chat_with_note(client, history, question, note):
"""Generate a response with the note included as context"""
augmented_history = history + [
{"role": "system", "content": f"[Conversation Notes]\n{note}"}
]
augmented_history.append({"role": "user", "content": question})
response = client.chat.completions.create(
model="gpt-4o",
messages=augmented_history
)
return response.choices[0].message.content
# Usage example
note = {}
history = []
for user_message in conversation_turns:
# 1. Update the note
note = take_note(client, user_message, note)
# 2. Generate a response
response = chat_with_note(client, history, user_message, note)
history.append({"role": "user", "content": user_message})
history.append({"role": "assistant", "content": response})Terminology
Related Resources
Original Abstract (Expand)
Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.