A Survey on Multi-Turn Interaction Capabilities of Large Language Models
TL;DR Highlight
A comprehensive survey on methods to evaluate and improve LLM multi-turn dialogue capabilities — from chatbots all the way to agents.
Who Should Read
Backend/ML engineers developing AI agents or chatbots who care about multi-turn dialogue quality, memory management, and tool use strategies.
Core Mechanics
- Multi-turn evaluation has evolved from MT-Bench to MT-Bench-101; LLM-as-a-Judge with GPT-4 as evaluator is now the standard
- 5 user interaction patterns: instruction clarification, extension, constraint addition, modification, and global consistency maintenance — fine-tuning around these patterns is effective
- External memory (summary trees, hash-based storage, hierarchical aggregation trees) is a practical workaround for context length limits
- Self-correction via prompting alone is insufficient — needs online RL training like SCoRe to actually work; offline SFT data causes distribution mismatch
- On multi-turn reasoning benchmark WILT, even the best model only achieves 28% accuracy — multi-turn logical reasoning is still a major LLM weakness
- RLHF extensions for multi-turn (ArCHer, MTPO, REFUEL, DPO multi-turn) have emerged, using hierarchical RL, Mirror Descent, and direct Q-value regression
Evidence
- WILT benchmark top model accuracy 28% — quantifying serious limits in multi-turn inductive logical reasoning
- MT-Bench-101: 13 subtasks, 3-tier classification — the most granular multi-turn evaluation framework
- LOCOMO benchmark: in 600-turn, 32-session conversations, both long-context LLMs and RAG models significantly lagged humans on temporal reasoning
- CodeGen experiments: multi-turn program synthesis consistently outperformed single-turn across all model sizes
How to Apply
- Adding memory to chatbots: use the ChatGPT Memory pattern — store conversation highlights via hash after each turn, retrieve relevant memories and inject next turn.
- When agents receive ambiguous requests: design DPO training data to prefer asking clarifying questions over generating answers — reduces unnecessary wrong outputs.
- In multi-turn code generation pipelines: insert CoT prompts (problem attribute prediction, natural language solution generation) before execution feedback for better performance.
Code Example
# Think-in-Memory pattern — save/retrieve key points in multi-turn conversations
import hashlib
class TurnMemory:
def __init__(self):
self.memory = {} # hash -> thought
def save_thought(self, thought: str):
key = hashlib.md5(thought.encode()).hexdigest()[:8]
self.memory[key] = thought
return key
def retrieve_relevant(self, query: str, top_k: int = 3) -> list[str]:
# In practice, use embedding similarity search
return list(self.memory.values())[:top_k]
memory = TurnMemory()
def chat_with_memory(user_input: str, llm_call):
# Retrieve relevant memories
relevant = memory.retrieve_relevant(user_input)
context = "\n".join([f"- {t}" for t in relevant])
# Prompt with memory included
prompt = f"""Previous context:\n{context}\n\nUser: {user_input}\nAssistant:"""
response = llm_call(prompt)
# Save the key point from this turn to memory
summary_prompt = f"One key point from this exchange: {user_input} -> {response}"
thought = llm_call(summary_prompt)
memory.save_thought(thought)
return responseTerminology
Related Resources
Original Abstract (Expand)
Multi-turn interaction in the dialogue system research refers to a system's ability to maintain context across multiple dialogue turns, enabling it to generate coherent and contextually relevant responses. Recent advancements in large language models (LLMs) have significantly expanded the scope of multi-turn interaction, moving beyond chatbots to enable more dynamic agentic interactions with users or environments. In this paper, we provide a focused review of the multi-turn capabilities of LLMs, which are critical for a wide range of downstream applications, including conversational search and recommendation, consultation services, and interactive tutoring. This survey explores four key aspects: (1) the core model capabilities that contribute to effective multi-turn interaction, (2) how multi-turn interaction is evaluated in current practice, (3) the general algorithms used to enhance multi-turn interaction, and (4) potential future directions for research in this field.