How do LLMs Compute Verbal Confidence

Mar 18, 2026•Dharshan Kumaran, Arthur Conmy, Federico Barbero +3•View PDF

TL;DR Highlight

A mechanistic interpretability study revealing that when LLMs say 'I'm confident/unsure,' that information is automatically computed and cached during answer generation.

Who Should Read

ML engineers adding confidence scores to LLM outputs or leveraging uncertainty estimation. Researchers and product developers wondering 'does an LLM know when it's wrong?'

Core Mechanics

When asked for a confidence score, the model doesn't compute it 'at that moment of request' — it already calculates and caches confidence while generating the answer (cached retrieval hypothesis)
In Gemma 3 27B, the post-answer newline token (PANL) is the key cache point. Confidence info is stored here first at layers 21-25, then transferred to the confidence-colon token (CC) at layers 30-35
Verbal confidence is NOT simply a summary of token log-probability (how confidently the model generated each word). Log-prob explains only 8.4% of verbal confidence variance
Activation steering (injecting activation vectors to manipulate model behavior) at the PANL position with high/low confidence vectors actually changes output confidence — evidence that real confidence info lives at that position
Attention blocking experiments confirmed the information flow path: answer tokens → PANL → CC in sequence
Same pattern reproduced across both Gemma 3 27B and Qwen 2.5 7B models, and both categorical/numeric prompt formats

Evidence

Token log-probability explains only 8.4% of verbal confidence variance (r=0.29, R²CV=0.084). The rest is a richer answer-quality assessment captured separately by internal representations
Activation swap experiment: transplanting PANL activations from a low-confidence donor trial to a high-confidence recipient trial systematically lowered confidence (cross-confidence swap, peak layer 26)
Attention blocking: blocking CC's direct attention to question/answer tokens showed only ~10% change rate (rejecting the just-in-time hypothesis). Blocking CC→PANL showed up to ~21% change rate (layers 30-36)
Gemma 3 27B calibration: ECE=0.12, AUROC=0.71. Qwen 2.5 7B: ECE=0.06, AUROC=0.65 — verbal confidence does meaningfully distinguish correct from incorrect answers

How to Apply

When extracting verbal confidence via prompts, ask for confidence immediately after the answer is generated. Since the model already cached confidence during answer generation, you can get meaningful confidence scores without additional chain-of-thought
In black-box API environments where log-probabilities aren't available, verbal confidence can serve as an uncertainty indicator — per this research, verbal confidence reflects answer-quality more richly than log-prob, making it more reliable than simple fluency checks
With white-box models, you can train a linear probe on residual stream activations at the PANL position (post-answer newline token) to decode correctness (AUROC ~0.75) or confidence magnitude as a separate pipeline

Code Example

snippet

# Verbal confidence extraction prompt pattern (based on paper Figure 8/13)
# Generate the answer first, then request confidence in the same context

system_prompt = """Answer the question, then rate your confidence."""

# Step 1: Generate answer
question = "What is the capital of France?"
answer_prompt = f"Q: {question}\nA:"
# → model generates: "Paris"

# Step 2: Request confidence in the same context (leveraging cached retrieval)
confidence_prompt = f"""Q: {question}
A: Paris
Confidence (0-9): """
# Since the model already cached confidence during answer generation,
# it returns a meaningful score even without chain-of-thought

# Categorical version (Yoon et al. 2025 style)
categorical_prompt = f"""Q: {question}
A: Paris
How confident are you?
Options: No chance / Really unlikely / Chances are slight / 
Unlikely / Somewhat likely / Likely / Very good chance / 
Highly likely / Almost certain
Confidence: """

print("Key insight: The newline position immediately after the answer (PANL) is the confidence cache point")
print("Simply ask for confidence right after the answer without prompting for separate reasoning")

Terminology

Verbal ConfidenceMaking the model answer numerically or categorically (e.g., 'almost certain') when asked 'how confident are you in your answer?' Confidence expressed in words rather than internal probabilities.

Token Log-ProbabilityThe mathematical certainty with which the model chose each word during generation. Higher means stronger conviction that 'this word is right.' Usually not exposed in standard APIs.

Activation SteeringInjecting a specific directional vector into the model's internal computation flow to manipulate output behavior. Like using a remote to change a TV channel — you can dial the model's 'confidence channel' up or down.

Activation PatchingReplacing activation values at a specific position inside the model with clean values to test whether that position is sufficient for a specific computation. Like swapping wires in an electrical circuit to pinpoint where the issue is.

Mechanistic InterpretabilityA research field that analyzes how information flows and computes inside LLMs at the circuit level. Opening the lid of black-box models to trace internal wiring.

PANLPost-Answer-NewLine. The newline token right after the model generates its answer. This research identified it as the key position where confidence information is cached.

Linear ProbeTraining a simple linear classifier on model internal activations to check whether specific information (e.g., answer correctness) is encoded at that position. Like using an X-ray to see bone structure — detecting internal information.

Calibration (ECE)A metric for whether a model's stated '90% confident' actually corresponds to 90% accuracy. Lower ECE (Expected Calibration Error) means the model's words match reality.

Related Resources

Original Abstract (Expand)

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.