MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation

Feb 17, 2025•Haochen Xue, Feilong Tang, Ming Hu +13•View PDF

TL;DR Highlight

A benchmark measuring how memory degrades in 20 multimodal AIs including GPT-4o during long conversations — plus simple solutions.

Who Should Read

Engineers developing multimodal AI chatbots or conversational AI services. Especially developers wanting to improve model memory and reasoning performance during extended conversations.

Core Mechanics

Released MMRC benchmark with 5,120 real user conversations and 28,720 manually labeled questions — uses actual user conversations, not GPT-generated ones
6 evaluation dimensions: Information Extraction (IE), Multi-turn Reasoning (CR), Information Update (IU), Image Management (IM), Long-term Memory Recall (MR), Answer Refusal (AR)
GPT-4o scored only 4.08 — far below human level (4.81); open-source model average was 2.55, dropping dramatically in real conversations
4 common failure patterns: memory loss mid-conversation, failure to reflect updated information, errors propagating to later turns, generating forced answers instead of refusing
Paradoxically, larger models have lower 'ability to refuse when not knowing' than smaller ones
NOTE-TAKING strategy: recording key information in JSON format in real-time during conversations boosted LLaVA-1.5-7B memory recall from 0.22 to 2.36

Evidence

GPT-4o MMRC overall score 4.08 vs human 4.81. Open-source model average 2.55, about half of human level
NOTE-TAKING on LLaVA-1.5-7B: Memory Recall (MR) 0.22 to 2.36 (+2.14), Information Extraction (IE): 0.91 to 2.57 (+1.66)
IE score 4.89 at 4-6 turns, drops to 1.07 at 20-22 turns — memory degrades sharply as conversations lengthen
Text-based conversations scored 26.3% higher than image-based; memory-related abilities 34.6% higher

How to Apply

For long-turn chatbots: apply NOTE-TAKING pattern — have a separate LLM extract key info as JSON each time a user message arrives, and include this note in context when responding.
For image-containing conversations in multimodal chatbots: summarize image content as text and record in notes — improves image management score.
For models generating forced answers instead of refusing: add explicit instruction in system prompt 'if not mentioned in the conversation, you must refuse.'

Code Example

snippet

# NOTE-TAKING strategy implementation example

SYSTEM_PROMPT_NOTETAKER = """
You are an intelligent chatbot designed to accurately record information from conversations in JSON format.
- Focus on the user's preferences and the facts within the conversation.
- If the user presents a new fact, it should overwrite the outdated fact.
- REMEMBER NOT TO RESPOND TO THE USER'S INPUT!
"""

USER_PROMPT_NOTETAKER = """
Extract key information from the input, including the user's preferences, events, and facts.
If you detect a change in information, overwrite the outdated content.

Here is the user input: {user_input}

Provide the Note in JSON format, for example:
{{
  "user_info": {{
    "health_conditions": ["High blood pressure", "Dairy allergy"]
  }},
  "preferences": {{
    "food_preference": "Italian cuisine"
  }},
  "purpose": "Seeking meal suggestions."
}}
"""

def take_note(client, user_input, current_note):
    """Extract key information from user message and update the note"""
    prompt = USER_PROMPT_NOTETAKER.format(user_input=user_input)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT_NOTETAKER},
            {"role": "user", "content": f"Current note: {current_note}\n\n{prompt}"}
        ]
    )
    return response.choices[0].message.content

def chat_with_note(client, history, question, note):
    """Generate a response with the note included as context"""
    augmented_history = history + [
        {"role": "system", "content": f"[Conversation Notes]\n{note}"}
    ]
    augmented_history.append({"role": "user", "content": question})
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=augmented_history
    )
    return response.choices[0].message.content

# Usage example
note = {}
history = []
for user_message in conversation_turns:
    # 1. Update the note
    note = take_note(client, user_message, note)
    # 2. Generate a response
    response = chat_with_note(client, history, user_message, note)
    history.append({"role": "user", "content": user_message})
    history.append({"role": "assistant", "content": response})

Terminology

MLLMA large AI model that can understand images. Models like GPT-4o and Gemini that let you show photos during conversation.

BenchmarkA test problem set for objectively comparing AI model performance.

Multi-turn ReasoningThe ability to synthesize information from multiple conversation turns to answer.

Information UpdateWhen a user corrects previously stated information, the ability to recognize and update it.

Answer RefusalThe ability to honestly say 'I don't know' when asked about something not mentioned in the conversation.

NOTE-TAKINGA strategy of recording important information in real-time during conversation as JSON to compensate for memory issues.

Open-ended ConversationConversation that proceeds freely however the user wants, without predefined scenarios.

Related Resources

MMRC GitHub Repository

Original Abstract (Expand)

Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.