Single-Turn을 넘어서: Large Language Models의 Multi-Turn 상호작용 서베이

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Apr 7, 2025•Yubo Li, Xiaobin Shen, X. Yao +4•View PDF

TL;DR Highlight

LLM의 다중 대화(multi-turn) 평가 벤치마크, 개선 방법론, 실제 적용 사례를 한 번에 정리한 종합 서베이.

Who Should Read

챗봇, 튜터링 시스템, 의료 상담 AI 등 멀티턴 대화 시스템을 구축하거나 평가 파이프라인을 설계하는 AI 개발자 및 ML 엔지니어. 특히 단순 Q&A를 넘어 맥락 유지가 중요한 프로덕션 서비스를 운영 중인 팀.

Core Mechanics

멀티턴 설정에서 LLM 성능이 크게 떨어짐 - MT-Eval 분석에 따르면 대부분의 LLM이 싱글턴 대비 멀티턴에서 정확도가 유의미하게 하락하며, 오류가 턴을 거듭할수록 누적됨
표준 RLHF/SFT는 멀티턴에서 효과가 제한적 - MT-Bench-101(21개 LLM 평가)에서 일반적인 alignment 기법이 멀티턴 일관성을 보장하지 않음을 확인
멀티턴 jailbreak가 훨씬 위험 - 'Crescendo' 전략처럼 무해한 질문을 점진적으로 쌓아 가드레일을 우회하는 공격은 기존 필터로 막기 어렵고, 현재 방어 기법은 보안-사용성 트레이드오프 문제가 있음
LLM-as-a-judge는 80% 이상 인간과 동의하지만 편향 존재 - Preference Leakage(같은 계열 모델 편향), Contextual Sensitivity, Reference Dependence 등 3가지 주요 편향이 확인됨
멀티턴 전용 RL이 필요 - ArCHer, SCoRe, DMPO 등 대화 전체 궤적을 최적화하는 멀티턴 RL이 싱글턴 DPO/PPO보다 멀티턴 시나리오에서 효과적임
외부 메모리+RAG+지식그래프 조합이 긴 대화 품질 유지에 핵심 - MemPrompt, LongMemEval, GNN-RAG 등 외부 통합 방식이 컨텍스트 한계를 보완하는 실용적 전략으로 부상

Evidence

MathChat-Agent 프레임워크: 멀티턴 LLM 협업으로 경쟁 수학 문제 정확도 약 6% 향상 (단순 싱글턴 CoT 대비)
MINT 벤치마크: 툴 사용 또는 피드백 턴 추가 시 문제 해결 성공률 1~17% 추가 향상
GPT-4 기반 멀티턴 jailbreak(Crescendo): GPT-3.5, GPT-4, Med-PaLM2 등 주요 모델이 MedFuzz 변형 벤치마크에서 성능 저하 확인
InstructGPT 1.3B + RLHF가 175B GPT-3보다 인간 평가에서 더 높은 점수 획득 - 파라미터가 100배 이상 적어도 alignment 학습으로 품질 역전 가능

How to Apply

멀티턴 평가 파이프라인 구축 시: MT-Bench(일반), MathChat-Bench(수학), InterCode(코딩) 등 태스크별 전용 벤치마크를 선택하고, LLM-as-a-judge 사용 시 Preference Leakage 방지를 위해 학습 계열이 다른 모델을 judge로 지정할 것
RAG 기반 챗봇에서 긴 대화 맥락 유지가 필요한 경우: MemTree나 Position 같은 메모리 증강 방식을 도입하거나, 주기적으로 대화 요약(COMEDY 방식)을 생성해 다음 턴 입력에 압축 메모로 주입하면 컨텍스트 드리프트를 줄일 수 있음
멀티턴 jailbreak 방어가 필요한 서비스: NeMoGuardrails 단독 사용은 false positive가 많으므로, CoT 기반 거절 판단 + 시스템 프롬프트 강화를 조합하되, 각 턴이 아닌 '누적 대화 전체'를 맥락으로 판단하는 로직을 추가할 것

Code Example

snippet

# 멀티턴 대화에서 요약 기반 메모리 주입 (COMEDY 스타일)
import openai

def compress_history(history: list[dict], model="gpt-4o-mini") -> str:
    """대화 히스토리를 압축 메모로 변환"""
    history_text = "\n".join([f"{m['role']}: {m['content']}" for m in history])
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "다음 대화에서 핵심 사실, 사용자 선호, 미완 요청만 3~5줄로 요약하세요."},
            {"role": "user", "content": history_text}
        ]
    )
    return response.choices[0].message.content

def multi_turn_chat(user_input: str, history: list[dict], compress_every: int = 5):
    """매 N턴마다 히스토리를 압축해서 컨텍스트 유지"""
    history.append({"role": "user", "content": user_input})
    
    # 일정 턴 초과 시 압축 메모 생성
    if len(history) > compress_every * 2:
        memo = compress_history(history[:-2])  # 최근 2개 제외하고 압축
        compressed_history = [
            {"role": "system", "content": f"[이전 대화 요약]\n{memo}"},
            *history[-2:]  # 최근 대화만 유지
        ]
        messages = compressed_history
    else:
        messages = [{"role": "system", "content": "당신은 친절한 어시스턴트입니다."}] + history
    
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    assistant_msg = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_msg})
    return assistant_msg, history

# 사용 예시
history = []
for user_msg in ["안녕, 나는 파이썬 초보야", "리스트 컴프리헨션 뭐야?", "예시 더 줘", "그럼 딕셔너리는?", "실전 예제 부탁해", "아까 말한 리스트 방식이랑 비교해줘"]:
    reply, history = multi_turn_chat(user_msg, history)
    print(f"User: {user_msg}")
    print(f"AI: {reply}\n")

Terminology

multi-turn interaction한 번의 질문-답변으로 끝나지 않고, 사람과 AI가 여러 차례 주고받는 대화 방식. 카카오톡 채팅처럼 맥락이 이어지는 대화.

RLHF사람이 AI 답변에 점수를 매기면 그걸 보고 AI가 학습하는 방식. 학생이 선생님 피드백을 받아 글쓰기를 고치는 것과 비슷.

DPO좋은 답변과 나쁜 답변 쌍을 보여줘서 AI가 어떤 게 더 나은지 직접 학습하게 하는 기법. PPO보다 구현이 단순해서 최근 많이 씀.

SFT정답 예시를 직접 보여주고 따라하게 훈련하는 방법. 요리 레시피를 보고 따라 만드는 것과 같음.

LLM-as-a-judge사람 대신 GPT-4 같은 강력한 LLM이 다른 AI의 답변 품질을 평가하는 방식. 선생님 대신 우등생에게 채점을 맡기는 것.

RAGAI가 답변할 때 외부 문서나 데이터베이스를 실시간으로 검색해서 참고하는 기법. 오픈북 시험처럼 필요한 정보를 그때그때 찾아씀.

jailbreakAI의 안전 장치를 우회해서 원래 거부해야 할 유해한 답변을 끌어내는 공격 기법. 잠긴 문을 열쇠 없이 따는 것과 비슷.

LoRA모델 전체를 다시 학습하지 않고 작은 어댑터 레이어만 추가해서 특정 용도에 맞게 튜닝하는 기법. 스마트폰에 케이스만 바꿔 끼우는 것처럼 가볍게 적용 가능.

Related Resources

Awesome-Multi-Turn-LLMs GitHub Repository

Original Abstract (Expand)

Recent advances in large language models (LLMs) have substantially improved single-turn task performance, yet real-world applications increasingly demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent progress in evaluating and enhancing multi-turn LLM interactions. Centered on a task-oriented taxonomy-spanning instruction following in domains such as mathematics and coding, and conversational engagement in role-playing, healthcare, education, and adversarial jailbreak settings-we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness across prolonged dialogues. We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-context learning, supervised fine-tuning, reinforcement learning, and architectural innovations), external integration approaches (memory augmentation, retrieval-based methods, and knowledge graphs), and agent-based techniques for collaborative interaction. Finally, we identify open challenges and promising directions for future research to further improve the robustness and effectiveness of multi-turn LLM interactions.