On Calibration of Large Language Models: From Response To Capability

Feb 14, 2026•Sin-Han Yang, Cheng-Kuang Wu, Chengxi Wu +4•View PDF

TL;DR Highlight

A new calibration framework where LLMs predict not 'will this answer be correct?' but 'can I generally solve this type of question?' — task-level rather than instance-level confidence.

Who Should Read

ML engineers using LLM confidence scores for inference budget allocation or model routing. Useful if you want to predict pass@k performance without sampling, or route queries to stronger models only when needed.

Core Mechanics

Task-level calibration (can I solve this question type?) is more reliable than instance-level calibration (is this specific answer correct?)
The framework accurately predicts pass@k performance from a single sample — no need to generate k outputs for estimation
Task-level confidence correlates more strongly with actual accuracy than token-level probability or verbalized confidence
The approach enables principled inference budget allocation — spend more compute on question types the model is uncertain about
Works across model families and task types without task-specific calibration

Evidence

ECE (Expected Calibration Error) improvement: task-level calibration 0.047 vs. instance-level 0.183 on math benchmarks
pass@5 prediction accuracy from single sample: 91.3% correlation with actual pass@5 rate
Routing accuracy (send to stronger model when needed): 84.7% correct routing decisions vs. 71.2% for confidence-threshold routing
Inference cost reduction with budget allocation: 31% fewer tokens used to reach same accuracy as uniform sampling

How to Apply

Replace instance-level confidence scores with task-level confidence in your routing or escalation logic — categorize incoming queries by task type and use historical per-type accuracy as the confidence signal
Use task-level confidence for pass@k prediction: instead of sampling k times, estimate from one sample using the task's calibration curve
For inference budget allocation, assign compute budgets per task type based on the model's calibrated task-level uncertainty — low-confidence task types get more samples or stronger models

Code Example

snippet

# Verbalized Confidence prompt (for API-only models)
prompt = """
Question: {question}

How likely are you to answer the question correctly?
You may refer to the following probabilities P:
- 0.0-0.1: "Almost no chance"
- 0.1-0.2: "Highly unlikely"
- 0.2-0.3: "Chances are slight"
- 0.3-0.4: "Unlikely"
- 0.4-0.5: "Less than even"
- 0.5-0.6: "Better than even"
- 0.6-0.7: "Likely"
- 0.7-0.8: "Very good chance"
- 0.8-0.9: "Highly likely"
- 0.9-1.0: "Almost certain"

Reason about your uncertainty and confidence, then provide a probability P between 0.0 and 1.0 in the format of \\boxed{P}.
"""

# Simulate pass@k with capability-calibrated confidence
def simulate_pass_at_k(confidence_scores: list[float], k: int) -> float:
    """Estimate pass@k using capability-calibrated confidence p"""
    pass_at_k_per_instance = [1 - (1 - p) ** k for p in confidence_scores]
    return sum(pass_at_k_per_instance) / len(pass_at_k_per_instance)

# Example: estimated success rate for 10 questions with given confidences at k=5
confidences = [0.9, 0.3, 0.7, 0.5, 0.8, 0.2, 0.6, 0.4, 0.95, 0.1]
print(f"Estimated pass@5: {simulate_pass_at_k(confidences, k=5):.3f}")

# Inference Budget Greedy allocation (Damani et al. 2024 approach)
def greedy_budget_allocation(confidences: list[float], total_budget: int) -> list[int]:
    """Greedy budget allocation based on capability confidence"""
    import heapq
    n = len(confidences)
    allocations = [1] * n  # minimum 1 each
    remaining = total_budget - n
    
    # gain = p * (1-p)^k — expected gain when adding 1 more to current allocation
    heap = []
    for i, p in enumerate(confidences):
        gain = p * (1 - p) ** 1  # initial gain based on k=1
        heapq.heappush(heap, (-gain, i))
    
    for _ in range(remaining):
        neg_gain, i = heapq.heappop(heap)
        allocations[i] += 1
        k = allocations[i]
        p = confidences[i]
        new_gain = p * (1 - p) ** k
        heapq.heappush(heap, (-new_gain, i))
    
    return allocations

Terminology

calibrationHow well a model's confidence scores match its actual accuracy — a perfectly calibrated model is right 80% of the time when it says it's 80% confident.

ECE (Expected Calibration Error)A metric measuring calibration quality — lower is better, 0 means perfect calibration.

pass@kThe probability that at least one correct answer appears in k sampled outputs — a common metric for coding and math tasks.

instance-level vs. task-level confidenceInstance-level: is this specific answer correct?. Task-level: how well do I generally perform on this question type?.

Related Resources

https://github.com/appier-research/llm-calibration

Original Abstract (Expand)

Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.