How Uncertainty Estimation Scales with Sampling in Reasoning Models

Mar 19, 2026•Maksym Del, Markus Kängsepp, Marharyta Domnich +4•View PDF

TL;DR Highlight

For measuring uncertainty in reasoning models, combining VC+SC with just 2 samples beats using 8 samples with a single method.

Who Should Read

Engineers and researchers doing uncertainty quantification for LLMs — especially those working with reasoning/chain-of-thought models where uncertainty measurement is harder.

Core Mechanics

Uncertainty estimation for reasoning models is harder than for standard LLMs because multiple valid reasoning paths can lead to the same answer
Verbal Confidence (VC) — asking the model to state its confidence — and Self-Consistency (SC) — sampling multiple times and checking agreement — capture complementary aspects of uncertainty
Combining VC and SC with just 2 samples outperforms either method alone with 8 samples in terms of calibration and discriminative ability
The combination works because VC captures epistemic uncertainty (what the model knows) while SC captures aleatoric uncertainty (variation in reasoning paths)
This 2-sample VC+SC combination is 4x cheaper than 8-sample SC while being more accurate
The approach is particularly valuable for routing decisions: when to use expensive verification vs. trust the model's answer

Evidence

ECE (Expected Calibration Error) for 2-sample VC+SC: 0.087 vs. 8-sample SC alone: 0.124 — 30% better calibration with 4x lower cost
AUROC for detecting wrong answers: VC+SC 0.79 vs. SC-only 0.72 vs. VC-only 0.68
Tested across 4 reasoning benchmarks (MATH, GSM8K, ARC, HellaSwag) with consistent improvement

How to Apply

For each query, generate 2 samples: have the model state its verbal confidence for each, then check if the two samples agree on the final answer. Combine the VC scores and agreement signal into a single uncertainty estimate.
Use this uncertainty estimate for routing: high-uncertainty queries get additional verification (more samples, external tool use, human review); low-uncertainty queries get fast greedy decoding.
Calibrate the routing threshold on a held-out validation set — the optimal threshold varies by task and acceptable error rate.

Code Example

snippet

import re
from collections import Counter

def get_scvc_score(responses: list[dict]) -> float:
    """
    responses: [{"answer": "A", "confidence": 85}, ...]
    SCVC = 0.5 * SC + 0.5 * VCavg
    """
    answers = [r["answer"] for r in responses]
    confidences = [r["confidence"] / 100.0 for r in responses]
    K = len(responses)
    
    # Majority vote answer
    counter = Counter(answers)
    majority_answer = counter.most_common(1)[0][0]
    
    # Self-Consistency (SC): fraction agreeing with majority
    sc = sum(1 for a in answers if a == majority_answer) / K
    
    # Verbalized Confidence avg (VC): avg confidence of majority voters
    majority_confidences = [
        confidences[i] for i, a in enumerate(answers) if a == majority_answer
    ]
    vc_avg = sum(majority_confidences) / len(majority_confidences)
    
    # SCVC hybrid
    scvc = 0.5 * sc + 0.5 * vc_avg
    return scvc

# Prompt template for elicitation
PROMPT_TEMPLATE = """
You are given a multiple choice question.
Solve the problem, showing your reasoning step by step.
After solving, provide your confidence in your answer.

Give a confidence number from 1 to 100 that represents
your overall confidence that the final answer is correct.

{question}
{choices}

Your response must end with exactly two lines:
ANSWER: $LETTER
CONFIDENCE: $NUMBER
"""

def parse_response(text: str) -> dict | None:
    answer_match = re.search(r'ANSWER:\s*([A-Z])', text)
    conf_match = re.search(r'CONFIDENCE:\s*(\d+)', text)
    if answer_match and conf_match:
        return {
            "answer": answer_match.group(1),
            "confidence": int(conf_match.group(1))
        }
    return None

# Example usage with K=2 samples
K = 2
responses = []
for _ in range(K):
    # raw_output = call_your_llm(PROMPT_TEMPLATE.format(...))
    # parsed = parse_response(raw_output)
    # if parsed: responses.append(parsed)
    pass

# score = get_scvc_score(responses)
# if score > 0.7: trust the answer, else flag for review

Terminology

Verbal Confidence (VC)Asking the model to explicitly state how confident it is in its answer — e.g., 'I'm 90% confident this is correct'.

Self-Consistency (SC)Sampling the model multiple times and measuring how often different samples agree on the final answer — high agreement = low uncertainty.

CalibrationHow well a model's stated confidence matches its actual accuracy — a well-calibrated model is right 70% of the time when it says it's 70% confident.

ECE (Expected Calibration Error)A metric for calibration quality — lower is better, 0 means perfect calibration.

AUROCArea Under the ROC Curve — measures how well a method discriminates between correct and incorrect predictions, regardless of threshold.

Related Resources

Original Abstract (Expand)

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.