LLM Calibration 재정의: 단일 응답 정확도에서 모델 능력 추정으로

On Calibration of Large Language Models: From Response To Capability

Feb 14, 2026•Sin-Han Yang, Cheng-Kuang Wu, Chengxi Wu +4•View PDF

TL;DR Highlight

LLM이 '이번 답변이 맞을까?'가 아니라 '이 질문을 전반적으로 풀 수 있을까?'를 예측하는 새로운 Calibration 프레임워크.

Who Should Read

LLM 신뢰도 점수를 inference budget 분배나 모델 라우팅에 활용하려는 ML 엔지니어. pass@k 성능을 샘플링 없이 예측하거나 테스트타임 컴퓨팅을 효율적으로 쓰고 싶은 연구자.

Core Mechanics

기존 Response Calibration은 단일 샘플 정답 여부를 confidence 타겟으로 삼지만, LLM은 stochastic(확률적)하게 응답하므로 한 번 틀렸다고 모델이 못 푸는 게 아님 — 이 불일치가 핵심 문제
새로 제안한 Capability Calibration은 같은 질문을 무한히 샘플링했을 때의 기대 정확도(expected accuracy)를 confidence 타겟으로 정의 — 모델의 '진짜 실력'을 측정
수학적으로 두 calibration의 차이 = 응답 정답 여부의 분산(variance)으로 분해 가능, 즉 모델이 들쭉날쭉할수록 둘의 차이가 커짐
LLM hidden state에 linear probe를 학습시키는 방법이 비용 대비 성능 최고 — 토큰 하나 디코딩보다 낮은 비용으로 Capability Calibration 달성
Capability Calibration 기반 confidence로 pass@k를 실제 샘플링 없이 시뮬레이션 가능 — Oracle-CC는 MSE≈0 달성, Oracle-RC는 k가 커질수록 오차 급증
gpt-oss-20b 기준 verbalized confidence만으로도 API 접근만으로 inference budget 분배에서 uniform 대비 의미 있는 성능 향상 달성

Evidence

MATH-500에서 Probe-MATH의 pass@64 MSE: Olmo-3-7B 0.0148 vs Oracle-RC 0.0935 — Probe가 Oracle-RC보다 6배 낮은 오차
OLMo-3-7B TriviaQA에서 Probe(TriviaQA로 학습) Brier score 0.1113 vs 랜덤 베이스라인 0.2745 — 약 2.5배 더 잘 캘리브레이션됨
gpt-oss-20b MATH-500 inference budget 분배에서 Probe-MATH와 Verbalized confidence 모두 Uniform allocation 대비 전 budget 구간에서 성능 우위
3개 모델 × 7개 데이터셋 전 조합에서 Response Calibration 타겟 C(x,ŷ)과 Capability Calibration 타겟이 명확히 분리됨을 산포도로 실증

How to Apply

API 기반 LLM(내부 접근 불가)이면 verbalized confidence 프롬프트로 query-level 신뢰도 뽑기: '이 질문에 정답을 맞출 확률이 얼마인가?'를 0~1 사이로 답하게 하고, 이를 inference budget 분배에 활용
오픈소스 모델(OLMo, Qwen 등) 운영 중이라면 각 transformer layer의 mean-pooled activation으로 linear probe를 학습해 capability-calibrated score 추출 — 추론 비용은 토큰 1개 이하
pass@k 성능 예측이 필요한 경우(예: 코딩 에이전트 성공률 추정), 100번 샘플링 대신 capability-calibrated confidence p로 pass@k = 1-(1-p)^k 공식으로 계산

Code Example

snippet

# Verbalized Confidence 프롬프트 (API-only 모델용)
prompt = """
Question: {question}

How likely are you to answer the question correctly?
You may refer to the following probabilities P:
- 0.0-0.1: "Almost no chance"
- 0.1-0.2: "Highly unlikely"
- 0.2-0.3: "Chances are slight"
- 0.3-0.4: "Unlikely"
- 0.4-0.5: "Less than even"
- 0.5-0.6: "Better than even"
- 0.6-0.7: "Likely"
- 0.7-0.8: "Very good chance"
- 0.8-0.9: "Highly likely"
- 0.9-1.0: "Almost certain"

Reason about your uncertainty and confidence, then provide a probability P between 0.0 and 1.0 in the format of \\boxed{P}.
"""

# Capability-calibrated confidence로 pass@k 시뮬레이션
def simulate_pass_at_k(confidence_scores: list[float], k: int) -> float:
    """capability-calibrated confidence p로 pass@k 추정"""
    pass_at_k_per_instance = [1 - (1 - p) ** k for p in confidence_scores]
    return sum(pass_at_k_per_instance) / len(pass_at_k_per_instance)

# 예시: 10개 질문, 각 confidence, k=5일 때 예상 성공률
confidences = [0.9, 0.3, 0.7, 0.5, 0.8, 0.2, 0.6, 0.4, 0.95, 0.1]
print(f"Estimated pass@5: {simulate_pass_at_k(confidences, k=5):.3f}")

# Inference Budget Greedy 분배 (Damani et al. 2024 방식)
def greedy_budget_allocation(confidences: list[float], total_budget: int) -> list[int]:
    """capability confidence 기반 greedy 예산 분배"""
    import heapq
    n = len(confidences)
    allocations = [1] * n  # 최소 1개씩
    remaining = total_budget - n
    
    # gain = p * (1-p)^k — 현재 할당에서 1개 더 줄 때의 기대 이득
    heap = []
    for i, p in enumerate(confidences):
        gain = p * (1 - p) ** 1  # k=1 기준 초기 gain
        heapq.heappush(heap, (-gain, i))
    
    for _ in range(remaining):
        neg_gain, i = heapq.heappop(heap)
        allocations[i] += 1
        k = allocations[i]
        p = confidences[i]
        new_gain = p * (1 - p) ** k
        heapq.heappush(heap, (-new_gain, i))
    
    return allocations

Terminology

Calibration모델이 '70% 확신한다'고 할 때 실제로 70% 맞아야 잘 캘리브레이션된 것. 자신감과 실제 정확도가 얼마나 일치하는지 측정하는 지표.

Brier score예측 확률과 실제 정답 여부의 평균 제곱 오차. 낮을수록 confidence가 실제 정확도에 잘 맞음. 기상 예보에서 강수 확률의 정확도를 측정할 때도 쓰이는 개념.

Linear probeLLM 내부의 hidden state(각 레이어 출력값)에서 특정 정보를 뽑아내기 위해 학습하는 단순 선형 분류기. 모델 자체는 건드리지 않고 내부 신호를 읽는 얇은 레이어.

pass@k같은 질문에 k번 샘플링했을 때 적어도 1번은 맞출 확률. 코드 생성 평가에서 많이 쓰임. k가 클수록 성공 확률이 올라감.

Stochastic decodingLLM이 매번 조금씩 다른 응답을 생성하는 방식. 온도(temperature)가 0이 아니면 같은 질문에도 다른 답이 나올 수 있음. 더 다양한 답을 만들어 성능을 높이지만, 단일 응답의 신뢰도를 낮춤.

Inference budget allocation여러 질문에 컴퓨팅 자원(샘플링 횟수)을 어떻게 나눌지 결정하는 것. 쉬운 질문엔 적게, 어려운 질문엔 많이 투자해 전체 정답률을 최대화.

Hidden stateLLM의 각 transformer 레이어가 입력을 처리한 내부 벡터 표현. 모델이 '무엇을 생각하는지'를 담고 있다고 볼 수 있는 수백~수천 차원의 숫자 배열.

Related Resources

https://github.com/appier-research/llm-calibration

Original Abstract (Expand)

Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.