How do LLMs Compute Verbal Confidence
TL;DR Highlight
A mechanistic interpretability study revealing that when LLMs say 'I'm confident/unsure,' that information is automatically computed and cached during answer generation.
Who Should Read
ML engineers adding confidence scores to LLM outputs or leveraging uncertainty estimation. Researchers and product developers wondering 'does an LLM know when it's wrong?'
Core Mechanics
- When asked for a confidence score, the model doesn't compute it 'at that moment of request' — it already calculates and caches confidence while generating the answer (cached retrieval hypothesis)
- In Gemma 3 27B, the post-answer newline token (PANL) is the key cache point. Confidence info is stored here first at layers 21-25, then transferred to the confidence-colon token (CC) at layers 30-35
- Verbal confidence is NOT simply a summary of token log-probability (how confidently the model generated each word). Log-prob explains only 8.4% of verbal confidence variance
- Activation steering (injecting activation vectors to manipulate model behavior) at the PANL position with high/low confidence vectors actually changes output confidence — evidence that real confidence info lives at that position
- Attention blocking experiments confirmed the information flow path: answer tokens → PANL → CC in sequence
- Same pattern reproduced across both Gemma 3 27B and Qwen 2.5 7B models, and both categorical/numeric prompt formats
Evidence
- Token log-probability explains only 8.4% of verbal confidence variance (r=0.29, R²CV=0.084). The rest is a richer answer-quality assessment captured separately by internal representations
- Activation swap experiment: transplanting PANL activations from a low-confidence donor trial to a high-confidence recipient trial systematically lowered confidence (cross-confidence swap, peak layer 26)
- Attention blocking: blocking CC's direct attention to question/answer tokens showed only ~10% change rate (rejecting the just-in-time hypothesis). Blocking CC→PANL showed up to ~21% change rate (layers 30-36)
- Gemma 3 27B calibration: ECE=0.12, AUROC=0.71. Qwen 2.5 7B: ECE=0.06, AUROC=0.65 — verbal confidence does meaningfully distinguish correct from incorrect answers
How to Apply
- When extracting verbal confidence via prompts, ask for confidence immediately after the answer is generated. Since the model already cached confidence during answer generation, you can get meaningful confidence scores without additional chain-of-thought
- In black-box API environments where log-probabilities aren't available, verbal confidence can serve as an uncertainty indicator — per this research, verbal confidence reflects answer-quality more richly than log-prob, making it more reliable than simple fluency checks
- With white-box models, you can train a linear probe on residual stream activations at the PANL position (post-answer newline token) to decode correctness (AUROC ~0.75) or confidence magnitude as a separate pipeline
Code Example
# Verbal confidence extraction prompt pattern (based on paper Figure 8/13)
# Generate the answer first, then request confidence in the same context
system_prompt = """Answer the question, then rate your confidence."""
# Step 1: Generate answer
question = "What is the capital of France?"
answer_prompt = f"Q: {question}\nA:"
# → model generates: "Paris"
# Step 2: Request confidence in the same context (leveraging cached retrieval)
confidence_prompt = f"""Q: {question}
A: Paris
Confidence (0-9): """
# Since the model already cached confidence during answer generation,
# it returns a meaningful score even without chain-of-thought
# Categorical version (Yoon et al. 2025 style)
categorical_prompt = f"""Q: {question}
A: Paris
How confident are you?
Options: No chance / Really unlikely / Chances are slight /
Unlikely / Somewhat likely / Likely / Very good chance /
Highly likely / Almost certain
Confidence: """
print("Key insight: The newline position immediately after the answer (PANL) is the confidence cache point")
print("Simply ask for confidence right after the answer without prompting for separate reasoning")Terminology
Related Resources
Original Abstract (Expand)
Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.