Reasoning Model에서 Sampling에 따른 Uncertainty Estimation 스케일링 분석

How Uncertainty Estimation Scales with Sampling in Reasoning Models

Mar 19, 2026•Maksym Del, Markus Kängsepp, Marharyta Domnich +4•View PDF

TL;DR Highlight

추론 모델에서 불확실성 측정할 때 샘플 2개로 VC+SC 조합하면 샘플 8개짜리 단일 방법보다 낫다.

Who Should Read

LLM 기반 서비스에서 모델 응답의 신뢰도(confidence)를 판단하거나 선택적 예측(selective prediction)을 구현하려는 ML 엔지니어. 특히 DeepSeek-R1, Qwen3 같은 reasoning 모델을 프로덕션에 붙이는 개발자.

Core Mechanics

Verbalized Confidence(VC, 모델이 직접 자기 확신을 숫자로 말하게 하는 방법)가 Self-Consistency(SC, 여러 번 답변해서 일치율로 신뢰도 측정)보다 적은 샘플에서 훨씬 강하다 — K=2 기준 수학 도메인에서 VC 73.4 vs SC 70.5 AUROC
VC와 SC를 단순히 0.5:0.5로 더한 SCVC 조합이 핵심 — 샘플 2개만으로 수학 도메인 AUROC 84.2로, 8개짜리 VC(81.4)나 SC(79.6)를 단독으로 쓰는 것보다 높음
샘플을 더 늘리는 것보다 두 신호를 조합하는 게 훨씬 비용 대비 효과적 — reasoning 모델은 샘플 하나가 긴 chain-of-thought라 비쌈
도메인에 따라 효과 차이 큼 — 수학은 SCVC K=2→K=8에서 +4.2 AUROC 추가 이득, STEM/인문학은 K=2 이후 +1~2 AUROC에 불과하고 빠르게 포화
사용한 모델: gpt-oss-20b, Qwen3-30B-A3B, DeepSeek-R1-8B 세 가지, 17개 태스크(수학/STEM/인문학)에서 검증
λ 가중치는 거의 상관없음 — 0 < λ < 1 범위 어디서나 성능 비슷하고 극단값(순수 VC 또는 순수 SC)에서만 떨어짐

Evidence

수학 도메인에서 SCVC K=2 AUROC 84.2 — 단일 샘플 VC 71.3 대비 +12.9 포인트, VC K=8(81.4)와 SC K=8(79.6) 모두 초과
STEM/인문학에서 SCVC K=2가 단일 VC 대비 +6.4 AUROC 향상 (73.8→80.2, 68.5→74.9), K=8 단일 신호 대비 각각 +1.8, +2.3 우세
VC 단독 스케일링: 수학에서 K=1→K=8 +10.1 AUROC, STEM/인문학은 +4.6/+4.1로 절반 이하
수학 K=5→K=8 SCVC 추가 이득 +1.5 AUROC이지만, STEM/인문학은 K=2 이후 ~+1~2 AUROC에 불과해 사실상 포화

How to Apply

Reasoning 모델(DeepSeek-R1, Qwen3 등)로 같은 질문을 최소 2번 샘플링하고, 각 응답에서 CONFIDENCE 숫자를 뽑아 majority answer 기준으로 평균 낸 VC와 SC를 0.5:0.5로 더해 신뢰도 점수로 사용하면 된다.
수학/코딩처럼 정답이 명확한 도메인에서는 K=5~8까지 늘려도 추가 이득이 있지만, 인문/법률/일반 지식 도메인에서는 K=2에서 이미 대부분의 이득이 생기므로 비용 절감을 위해 K=2로 고정하는 것이 실용적이다.
모델이 confidence를 직접 출력하도록 프롬프트에 'CONFIDENCE: $NUMBER (1~100)' 형식을 강제하고, 응답 파싱 실패 시 해당 샘플을 제외하는 예외처리를 넣으면 논문 방법을 그대로 재현할 수 있다.

Code Example

snippet

import re
from collections import Counter

def get_scvc_score(responses: list[dict]) -> float:
    """
    responses: [{"answer": "A", "confidence": 85}, ...]
    SCVC = 0.5 * SC + 0.5 * VCavg
    """
    answers = [r["answer"] for r in responses]
    confidences = [r["confidence"] / 100.0 for r in responses]
    K = len(responses)
    
    # Majority vote answer
    counter = Counter(answers)
    majority_answer = counter.most_common(1)[0][0]
    
    # Self-Consistency (SC): fraction agreeing with majority
    sc = sum(1 for a in answers if a == majority_answer) / K
    
    # Verbalized Confidence avg (VC): avg confidence of majority voters
    majority_confidences = [
        confidences[i] for i, a in enumerate(answers) if a == majority_answer
    ]
    vc_avg = sum(majority_confidences) / len(majority_confidences)
    
    # SCVC hybrid
    scvc = 0.5 * sc + 0.5 * vc_avg
    return scvc

# Prompt template for elicitation
PROMPT_TEMPLATE = """
You are given a multiple choice question.
Solve the problem, showing your reasoning step by step.
After solving, provide your confidence in your answer.

Give a confidence number from 1 to 100 that represents
your overall confidence that the final answer is correct.

{question}
{choices}

Your response must end with exactly two lines:
ANSWER: $LETTER
CONFIDENCE: $NUMBER
"""

def parse_response(text: str) -> dict | None:
    answer_match = re.search(r'ANSWER:\s*([A-Z])', text)
    conf_match = re.search(r'CONFIDENCE:\s*(\d+)', text)
    if answer_match and conf_match:
        return {
            "answer": answer_match.group(1),
            "confidence": int(conf_match.group(1))
        }
    return None

# Example usage with K=2 samples
K = 2
responses = []
for _ in range(K):
    # raw_output = call_your_llm(PROMPT_TEMPLATE.format(...))
    # parsed = parse_response(raw_output)
    # if parsed: responses.append(parsed)
    pass

# score = get_scvc_score(responses)
# if score > 0.7: trust the answer, else flag for review

Terminology

Uncertainty Estimation모델이 얼마나 자신 있는지 수치로 측정하는 것. 의사가 진단할 때 '90% 확신'이라고 말하는 것처럼 모델도 자기 답에 신뢰도를 붙이는 방법.

AUROC신뢰도 점수가 정답/오답을 얼마나 잘 구분하는지 보는 지표. 0.5는 랜덤 동전 던지기, 1.0은 완벽한 구분. 예: AUROC 85는 랜덤으로 뽑은 정답이 오답보다 높은 점수를 받을 확률이 85%.

Self-Consistency (SC)같은 질문을 여러 번 물어보고 답이 얼마나 일치하는지로 신뢰도를 추정하는 방법. 5명에게 같은 문제를 물었을 때 4명이 같은 답을 하면 그 답이 맞을 가능성이 높다는 원리.

Verbalized Confidence (VC)모델에게 '몇 % 확신하냐'고 직접 물어서 숫자를 받는 방법. 추가 샘플링 없이 한 번의 응답에서 바로 얻을 수 있음.

RLVRReinforcement Learning with Verifiable Rewards. 수학처럼 정답이 검증 가능한 문제로 강화학습 하는 방법. DeepSeek-R1, Qwen3 같은 reasoning 모델의 핵심 훈련 방식.

Reasoning Language Model (RLM)답변 전에 긴 사고 과정(chain-of-thought)을 거치는 LLM. DeepSeek-R1이나 o1처럼 문제를 풀기 전에 여러 가설을 내부적으로 탐색하는 모델.

Parallel Sampling같은 입력에 대해 모델을 여러 번 독립적으로 실행해서 여러 응답을 얻는 방식. 주사위를 여러 번 굴려서 분포를 파악하는 것과 같음.

Related Resources

Original Abstract (Expand)

Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.