How Uncertainty Estimation Scales with Sampling in Reasoning Models
TL;DR Highlight
For measuring uncertainty in reasoning models, combining VC+SC with just 2 samples beats using 8 samples with a single method.
Who Should Read
Engineers and researchers doing uncertainty quantification for LLMs — especially those working with reasoning/chain-of-thought models where uncertainty measurement is harder.
Core Mechanics
- Uncertainty estimation for reasoning models is harder than for standard LLMs because multiple valid reasoning paths can lead to the same answer
- Verbal Confidence (VC) — asking the model to state its confidence — and Self-Consistency (SC) — sampling multiple times and checking agreement — capture complementary aspects of uncertainty
- Combining VC and SC with just 2 samples outperforms either method alone with 8 samples in terms of calibration and discriminative ability
- The combination works because VC captures epistemic uncertainty (what the model knows) while SC captures aleatoric uncertainty (variation in reasoning paths)
- This 2-sample VC+SC combination is 4x cheaper than 8-sample SC while being more accurate
- The approach is particularly valuable for routing decisions: when to use expensive verification vs. trust the model's answer
Evidence
- ECE (Expected Calibration Error) for 2-sample VC+SC: 0.087 vs. 8-sample SC alone: 0.124 — 30% better calibration with 4x lower cost
- AUROC for detecting wrong answers: VC+SC 0.79 vs. SC-only 0.72 vs. VC-only 0.68
- Tested across 4 reasoning benchmarks (MATH, GSM8K, ARC, HellaSwag) with consistent improvement
How to Apply
- For each query, generate 2 samples: have the model state its verbal confidence for each, then check if the two samples agree on the final answer. Combine the VC scores and agreement signal into a single uncertainty estimate.
- Use this uncertainty estimate for routing: high-uncertainty queries get additional verification (more samples, external tool use, human review); low-uncertainty queries get fast greedy decoding.
- Calibrate the routing threshold on a held-out validation set — the optimal threshold varies by task and acceptable error rate.
Code Example
import re
from collections import Counter
def get_scvc_score(responses: list[dict]) -> float:
"""
responses: [{"answer": "A", "confidence": 85}, ...]
SCVC = 0.5 * SC + 0.5 * VCavg
"""
answers = [r["answer"] for r in responses]
confidences = [r["confidence"] / 100.0 for r in responses]
K = len(responses)
# Majority vote answer
counter = Counter(answers)
majority_answer = counter.most_common(1)[0][0]
# Self-Consistency (SC): fraction agreeing with majority
sc = sum(1 for a in answers if a == majority_answer) / K
# Verbalized Confidence avg (VC): avg confidence of majority voters
majority_confidences = [
confidences[i] for i, a in enumerate(answers) if a == majority_answer
]
vc_avg = sum(majority_confidences) / len(majority_confidences)
# SCVC hybrid
scvc = 0.5 * sc + 0.5 * vc_avg
return scvc
# Prompt template for elicitation
PROMPT_TEMPLATE = """
You are given a multiple choice question.
Solve the problem, showing your reasoning step by step.
After solving, provide your confidence in your answer.
Give a confidence number from 1 to 100 that represents
your overall confidence that the final answer is correct.
{question}
{choices}
Your response must end with exactly two lines:
ANSWER: $LETTER
CONFIDENCE: $NUMBER
"""
def parse_response(text: str) -> dict | None:
answer_match = re.search(r'ANSWER:\s*([A-Z])', text)
conf_match = re.search(r'CONFIDENCE:\s*(\d+)', text)
if answer_match and conf_match:
return {
"answer": answer_match.group(1),
"confidence": int(conf_match.group(1))
}
return None
# Example usage with K=2 samples
K = 2
responses = []
for _ in range(K):
# raw_output = call_your_llm(PROMPT_TEMPLATE.format(...))
# parsed = parse_response(raw_output)
# if parsed: responses.append(parsed)
pass
# score = get_scvc_score(responses)
# if score > 0.7: trust the answer, else flag for reviewTerminology
Related Resources
Original Abstract (Expand)
Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.