멀티턴 대화에서 LLM의 Confidence 추정

Confidence Estimation for LLMs in Multi-turn Interactions

Jan 5, 2026•Caiqi Zhang, Ruihan Yang, Xiaochen Zhu +5•View PDF

TL;DR Highlight

챗봇이 대화가 길어질수록 자신의 정확도를 제대로 반영하는 confidence를 내뱉는지 처음으로 체계적으로 측정했더니, 기존 방법들이 다 별로였고 새로 제안한 P(SUFFICIENT)가 그나마 낫다.

Who Should Read

에이전트 파이프라인에서 LLM이 언제 확신을 가지고 행동해야 하는지 판단 기준을 고민하는 개발자. 멀티턴 챗봇이나 human-in-the-loop 시스템에서 hallucination 감지 로직을 설계하는 ML 엔지니어.

Core Mechanics

기존 confidence 측정법(verbalized, self-consistency, P(TRUE)) 모두 멀티턴 대화에서 calibration이 형편없음 — InfoECE 40~80% 수준
새로 제안한 P(SUFFICIENT)는 '지금까지 나온 힌트가 이 답을 유일하게 확정하기에 충분한가?'를 묻는 logit 기반 probe로, monotonicity와 calibration 모두에서 가장 나음
Llama3.1-70B 기준 P(SUFFICIENT)의 GUESS 데이터셋 InfoECE가 5.27%로, 다른 방법(최대 79.97%)보다 압도적으로 낮음
가짜 힌트(placebo) 실험에서 P(SUFFICIENT)만이 무의미한 턴에 confidence를 오히려 낮춤 — 턴 수가 늘어서 confident해지는 게 아니라 실제 정보를 추적
멀티턴 vs 싱글턴 요약본 비교 시 accuracy 차이는 1% 미만이지만, confidence 신호는 방식에 따라 크게 달라짐
모델이 클수록(70B/72B) P(SUFFICIENT)의 monotonicity τ가 크게 올라감 — Qwen2.5-72B GUESS에서 τ=93.91%

Evidence

P(SUFFICIENT)의 Llama3.1-70B GUESS InfoECE: 5.27% (VANILLA-VERB 65.52%, SC 56.88% 대비)
Kendall's τ (ground truth 기준) P(SUFFICIENT): Llama3.1-70B에서 20Q 91.62%, GUESS 86.55%, GRACE 85.90%
P(TRUE)는 placebo 힌트에도 confidence 상승 — Llama3.1-8B GUESS에서 +11.75, Qwen2.5-72B에서 +14.61 (p<10⁻⁶)
P(SUFFICIENT)는 placebo 시 오히려 하락 — Llama3.1-70B GUESS: 14.27→2.97 (p<0.05)

How to Apply

에이전트가 툴 호출 여부를 결정할 때, 단순히 '이 답이 맞냐?'(P(TRUE)) 대신 '지금까지 모인 정보가 이 답을 확정하기에 충분한가?'로 바꿔 물으면 더 신뢰할 수 있는 confidence 신호 얻을 수 있음
RAG나 clarification 루프에서 LLM이 추가 정보를 요청할지 판단할 때, P(SUFFICIENT) 방식으로 매 턴 probe를 추가해 confidence가 threshold를 넘지 못하면 추가 질문을 유도하는 로직 구현 가능
멀티턴 대화 히스토리를 단일 요약본으로 압축해서 넘기는 경우, 작은 모델(8B급)은 calibration이 나빠지므로 원본 턴 구조를 유지하는 게 더 안전함

Code Example

snippet

# P(SUFFICIENT) 방식으로 confidence 추정하는 예시
# 모델이 현재까지의 힌트/대화가 답을 유일하게 확정하는지 binary probe

PSUFFICIENT_PROMPT = """
{dialogue_history}

Based only on the information and hints provided above,
does that information sufficiently entail that the correct answer is exactly {answer}?

A. Yes — the information is sufficient to conclude {answer}.
B. No — the information is insufficient, allows alternatives, or contradicts {answer}.

Output format: **A** or **B** only (single uppercase letter; no spaces, punctuation, or explanation):
"""

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def get_p_sufficient(model, tokenizer, dialogue_history: str, answer: str) -> float:
    prompt = PSUFFICIENT_PROMPT.format(
        dialogue_history=dialogue_history,
        answer=answer
    )
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits[0, -1, :]  # 마지막 토큰의 logits
    
    token_A = tokenizer.encode("A", add_special_tokens=False)[0]
    token_B = tokenizer.encode("B", add_special_tokens=False)[0]
    
    probs = torch.softmax(logits[[token_A, token_B]], dim=0)
    return probs[0].item()  # P(A) = P(SUFFICIENT)

# 사용 예시
# confidence = get_p_sufficient(model, tokenizer, history, current_answer)
# if confidence < 0.5: agent가 추가 clarification 질문 요청

Terminology

Confidence Estimation모델이 자기 답변이 얼마나 맞을 것 같은지 0~1 사이 숫자로 나타내는 것. 틀릴 것 같으면 낮은 숫자, 확신하면 높은 숫자를 내야 '잘 된' 추정.

Calibration모델이 '70% 확신한다'고 했을 때 실제로 70%쯤 맞아야 잘 calibrated된 것. 항상 '99% 확신'하는데 절반만 맞으면 poorly calibrated.

InfoECE멀티턴 대화처럼 길이가 다른 대화들을 공정하게 비교하기 위해 만든 calibration 오차 지표. 낮을수록 calibration이 잘 된 것.

Monotonicity대화가 진행되면서 힌트가 쌓일수록 confidence가 계속 올라가는 성질. 이상적인 에이전트라면 정보가 늘수록 더 확신해야 함.

Kendall's τ순위 상관계수. 여기서는 confidence가 턴이 지날수록 올라가는 경향이 얼마나 일관적인지 측정. 1에 가까울수록 단조 증가.

Verbalized Confidence모델에게 '몇 점이나 확신해?'라고 직접 물어서 숫자를 말하게 하는 방법. 프롬프트 하나로 쉽게 쓸 수 있지만 실제 내부 확신과 다를 수 있음.

Self-Consistency같은 질문을 여러 번 샘플링해서 답이 얼마나 일치하는지로 confidence를 추정하는 방법. 답이 20번 중 16번 나오면 confidence=0.8.

Logit모델이 각 토큰을 출력하기 직전에 계산하는 원시 점수. Softmax를 통해 확률로 변환되는 값으로, 모델 내부 확신을 직접 읽을 수 있는 신호.

Related Resources

https://arxiv.org/abs/2601.02179

Original Abstract (Expand)

While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new"Hinter-Guesser"paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.