A Survey on Multi-Turn Interaction Capabilities of Large Language Models

Jan 17, 2025•Chen Zhang, Xinyi Dai, Yaxiong Wu +4•View PDF

TL;DR Highlight

A comprehensive survey on methods to evaluate and improve LLM multi-turn dialogue capabilities — from chatbots all the way to agents.

Who Should Read

Backend/ML engineers developing AI agents or chatbots who care about multi-turn dialogue quality, memory management, and tool use strategies.

Core Mechanics

Multi-turn evaluation has evolved from MT-Bench to MT-Bench-101; LLM-as-a-Judge with GPT-4 as evaluator is now the standard
5 user interaction patterns: instruction clarification, extension, constraint addition, modification, and global consistency maintenance — fine-tuning around these patterns is effective
External memory (summary trees, hash-based storage, hierarchical aggregation trees) is a practical workaround for context length limits
Self-correction via prompting alone is insufficient — needs online RL training like SCoRe to actually work; offline SFT data causes distribution mismatch
On multi-turn reasoning benchmark WILT, even the best model only achieves 28% accuracy — multi-turn logical reasoning is still a major LLM weakness
RLHF extensions for multi-turn (ArCHer, MTPO, REFUEL, DPO multi-turn) have emerged, using hierarchical RL, Mirror Descent, and direct Q-value regression

Evidence

WILT benchmark top model accuracy 28% — quantifying serious limits in multi-turn inductive logical reasoning
MT-Bench-101: 13 subtasks, 3-tier classification — the most granular multi-turn evaluation framework
LOCOMO benchmark: in 600-turn, 32-session conversations, both long-context LLMs and RAG models significantly lagged humans on temporal reasoning
CodeGen experiments: multi-turn program synthesis consistently outperformed single-turn across all model sizes

How to Apply

Adding memory to chatbots: use the ChatGPT Memory pattern — store conversation highlights via hash after each turn, retrieve relevant memories and inject next turn.
When agents receive ambiguous requests: design DPO training data to prefer asking clarifying questions over generating answers — reduces unnecessary wrong outputs.
In multi-turn code generation pipelines: insert CoT prompts (problem attribute prediction, natural language solution generation) before execution feedback for better performance.

Code Example

snippet

# Think-in-Memory pattern — save/retrieve key points in multi-turn conversations
import hashlib

class TurnMemory:
    def __init__(self):
        self.memory = {}  # hash -> thought
    
    def save_thought(self, thought: str):
        key = hashlib.md5(thought.encode()).hexdigest()[:8]
        self.memory[key] = thought
        return key
    
    def retrieve_relevant(self, query: str, top_k: int = 3) -> list[str]:
        # In practice, use embedding similarity search
        return list(self.memory.values())[:top_k]

memory = TurnMemory()

def chat_with_memory(user_input: str, llm_call):
    # Retrieve relevant memories
    relevant = memory.retrieve_relevant(user_input)
    context = "\n".join([f"- {t}" for t in relevant])
    
    # Prompt with memory included
    prompt = f"""Previous context:\n{context}\n\nUser: {user_input}\nAssistant:"""
    response = llm_call(prompt)
    
    # Save the key point from this turn to memory
    summary_prompt = f"One key point from this exchange: {user_input} -> {response}"
    thought = llm_call(summary_prompt)
    memory.save_thought(thought)
    
    return response

Terminology

LLM-as-a-JudgeUsing a powerful model like GPT-4 as an automated judge to evaluate other model responses. Faster and more reproducible than human scoring.

DPODirect Preference Optimization. Trains models to prefer good responses over bad ones via preference pairs.

SFTSupervised Fine-Tuning. Show the model gold-standard examples and have it imitate them.

RLHFReinforcement Learning from Human Feedback. Creates a reward signal from human preference feedback to improve the model via RL.

CoTChain-of-Thought. A prompting technique where the model explicitly writes out intermediate reasoning before answering.

MDPMarkov Decision Process. A sequential decision-making model. Modeling conversation as MDP allows direct application of RL algorithms.

RAGRetrieval-Augmented Generation. The model retrieves relevant content from external document DBs and includes it when generating responses.

Related Resources

Original Abstract (Expand)

Multi-turn interaction in the dialogue system research refers to a system's ability to maintain context across multiple dialogue turns, enabling it to generate coherent and contextually relevant responses. Recent advancements in large language models (LLMs) have significantly expanded the scope of multi-turn interaction, moving beyond chatbots to enable more dynamic agentic interactions with users or environments. In this paper, we provide a focused review of the multi-turn capabilities of LLMs, which are critical for a wide range of downstream applications, including conversational search and recommendation, consultation services, and interactive tutoring. This survey explores four key aspects: (1) the core model capabilities that contribute to effective multi-turn interaction, (2) how multi-turn interaction is evaluated in current practice, (3) the general algorithms used to enhance multi-turn interaction, and (4) potential future directions for research in this field.