LLM은 어떻게 Verbal Confidence를 계산하는가?

How do LLMs Compute Verbal Confidence

Mar 18, 2026•Dharshan Kumaran, Arthur Conmy, Federico Barbero +3•View PDF

TL;DR Highlight

LLM이 '확신한다/모르겠다'고 말할 때, 그 정보는 답변 생성 중 자동으로 만들어져 캐시된다는 것을 mechanistic interpretability로 밝혀낸 연구.

Who Should Read

LLM 출력에 신뢰도 점수를 붙이거나 불확실성 추정(uncertainty estimation)을 활용하는 ML 엔지니어. LLM이 '틀렸다는 걸 스스로 아는가'를 고민하는 연구자나 프로덕트 개발자.

Core Mechanics

LLM에게 자신감 점수를 물으면, 모델은 '요청받은 그 순간'에 계산하는 게 아니라 답변을 생성하면서 이미 confidence를 계산해 캐시해 둔다 (cached retrieval hypothesis)
Gemma 3 27B 기준, 답변 직후 뉴라인 토큰(PANL)이 핵심 캐시 지점. 레이어 21~25에서 confidence 정보가 여기에 먼저 저장되고, 이후 레이어 30~35에서 confidence-colon 토큰(CC)으로 전달됨
Verbal confidence는 단순히 token log-probability(모델이 각 단어를 얼마나 확신하며 생성했는지)의 요약이 아님. log-prob은 verbal confidence 분산의 고작 8.4%만 설명함
Activation steering(활성화 벡터를 주입해 모델 행동을 조종하는 기법)으로 PANL 위치에 high/low confidence 벡터를 주입하면 실제 출력 confidence가 바뀜 — 그 위치에 진짜 confidence 정보가 있다는 증거
Attention blocking(특정 토큰 간 attention을 차단하는 기법) 실험으로 정보 흐름 경로 확인: 답변 토큰 → PANL → CC 순서로 confidence가 전달됨
Gemma 3 27B와 Qwen 2.5 7B 두 모델, categorical/numeric 프롬프트 형식 모두에서 동일한 패턴 재현됨

Evidence

Token log-probability는 verbal confidence 분산의 8.4%만 설명 (r=0.29, R²CV=0.084). 나머지는 내부 representation이 별도로 포착하는 더 풍부한 answer-quality 평가
Activation swap 실험: high-confidence 수신 trial에 low-confidence 공여 trial의 PANL 활성화를 이식하면 confidence가 체계적으로 낮아짐 (cross-confidence swap, peak layer 26)
Attention blocking 실험에서 CC가 질문/답변 토큰에 직접 attention하는 경로를 차단해도 변화율 ~10%로 미미함 (just-in-time 가설 기각). 반면 CC→PANL 차단 시 최대 ~21% 변화율 (layers 30-36)
Gemma 3 27B calibration: ECE=0.12, AUROC=0.71. Qwen 2.5 7B: ECE=0.06, AUROC=0.65로 verbal confidence가 실제로 정답/오답을 어느 정도 구분함

How to Apply

Verbal confidence를 프롬프트로 추출할 때, 답변을 생성한 직후 바로 confidence를 묻는 게 효과적. 모델은 이미 답변 생성 중 confidence를 캐시했으므로 별도 chain-of-thought 없이도 의미 있는 신뢰도 점수를 얻을 수 있음
블랙박스 API라 log-probability를 못 쓰는 환경에서 verbal confidence를 uncertainty 지표로 활용할 때, 이 연구 결과대로 verbal confidence는 log-prob보다 answer-quality를 더 풍부하게 반영하므로 단순 fluency 체크보다 신뢰할 수 있음
화이트박스 모델을 쓴다면, PANL 위치(답변 직후 newline 토큰)의 residual stream activation에 linear probe를 학습시켜 정답 여부(AUROC ~0.75)나 confidence magnitude를 별도로 디코딩하는 파이프라인 구성 가능

Code Example

snippet

# Verbal confidence 추출 프롬프트 패턴 (논문 Figure 8/13 기반)
# 답변을 먼저 생성하고, 같은 컨텍스트에서 confidence를 바로 요청

system_prompt = """Answer the question, then rate your confidence."""

# Step 1: 답변 생성
question = "What is the capital of France?"
answer_prompt = f"Q: {question}\nA:"
# → model generates: "Paris"

# Step 2: 같은 컨텍스트에서 confidence 요청 (cached retrieval 활용)
confidence_prompt = f"""Q: {question}
A: Paris
Confidence (0-9): """
# 모델은 답변 생성 중 이미 confidence를 캐시했으므로
# chain-of-thought 없이도 의미 있는 점수 반환

# Categorical 버전 (Yoon et al. 2025 스타일)
categorical_prompt = f"""Q: {question}
A: Paris
How confident are you?
Options: No chance / Really unlikely / Chances are slight / 
Unlikely / Somewhat likely / Likely / Very good chance / 
Highly likely / Almost certain
Confidence: """

print("핵심: 답변 직후 newline 위치(PANL)가 confidence 캐시 지점")
print("별도 reasoning 유도 없이 답변 직후 바로 confidence를 물으면 됨")

Terminology

Verbal Confidence모델에게 '네 답에 얼마나 확신하니?'라고 물었을 때 숫자나 카테고리(예: '거의 확실')로 대답하게 하는 것. 내부 확률 대신 말로 표현된 자신감.

Token Log-Probability모델이 각 단어를 생성할 때 그 단어를 선택한 수학적 확신도. 높을수록 '이 단어가 맞다고 강하게 확신한 것'. 일반 API에서는 보통 노출 안 됨.

Activation Steering모델 내부 연산 흐름에 특정 방향의 벡터를 주입해 출력 행동을 조종하는 기법. 리모컨으로 TV 채널 바꾸듯, 모델의 '자신감 채널'을 올리거나 내릴 수 있음.

Activation Patching모델 내부 특정 위치의 활성화값을 깨끗한 값으로 교체해서, 그 위치가 특정 계산에 충분한지 테스트하는 기법. 전기 회로에서 특정 선을 교체해 어디서 문제가 생겼는지 찾는 것과 유사.

Mechanistic InterpretabilityLLM 내부에서 정보가 어떻게 흐르고 계산되는지 회로 수준으로 분석하는 연구 분야. 블랙박스 모델의 뚜껑을 열어 내부 배선을 추적하는 것.

PANLPost-Answer-NewLine의 약자. 모델이 답변을 생성하고 난 직후의 줄바꿈 토큰. 이 연구에서 confidence 정보가 캐시되는 핵심 위치로 밝혀짐.

Linear Probe모델 내부 활성화값에 간단한 선형 분류기를 학습시켜 특정 정보(예: 정답 여부)가 그 위치에 인코딩되어 있는지 확인하는 기법. X-ray로 뼈 구조 보듯 내부 정보를 탐지.

Calibration (ECE)모델이 '90% 확신'이라고 했을 때 실제로 90%의 확률로 맞는지를 나타내는 지표. ECE(Expected Calibration Error)가 낮을수록 말과 실제가 일치함.

Related Resources

Original Abstract (Expand)

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.