Confidence Dynamics를 활용한 Large Reasoning Model의 Early Stopping

TL;DR Highlight

모델의 확신도 변화를 추적한 조기 중단이 불필요한 reasoning을 제거하며 토큰을 25~50% 절약한다.

Who Should Read

DeepSeek-R1, Qwen3 같은 reasoning 모델을 프로덕션에 배포하면서 추론 비용이 너무 많이 나와 고민하는 ML 엔지니어. 긴 chain-of-thought 생성의 inference 비용을 줄이고 싶은 개발자.

Core Mechanics

올바른 추론 경로(correct trajectory)는 생성 초반에 confidence가 빠르게 올라가고 일찍 안정되는 반면, 틀린 경로는 confidence가 불안정하게 들쭉날쭉하는 패턴을 보임
틀린 추론 경로는 올바른 경로보다 평균 2배 이상 길어서(12K vs 25K 토큰) 전체 계산량의 상당 부분을 차지함
CoDE-Stop은 두 가지 신호를 조합함: (1) 신뢰도가 충분히 높으면 멈추는 ramping confidence threshold, (2) 불안정한 추론을 감지하는 degeneration score
Degeneration score는 초반 reasoning step에 더 높은 가중치(log weighting)를 부여하는데, 이는 초반 confidence 패턴이 정답/오답 구분에 더 유용하기 때문
추론이 길어져도 틀린 경로의 confidence가 올라가는 경향이 있어서, 늦은 단계의 confidence만 보면 오판하기 쉬움 - 그래서 초반 신호가 더 중요함
추가 학습이 전혀 필요 없고, 기존 모델에 inference-time에 그대로 붙일 수 있음. Chain-of-Draft 같은 프롬프트 기법과 함께 써도 추가 효과가 있음

Evidence

Qwen3-4B 기준으로 4개 벤치마크 평균에서 정확도를 유지하면서 토큰 사용량을 25% 줄임 (8344 → 5956 토큰)
Qwen3-14B에서 MATH500 정확도 93.0% 유지하면서 토큰을 49.1% 절감 (4878 → 2529 토큰)
DEER(가장 유사한 baseline) 대비 모든 모델·벤치마크에서 더 나은 accuracy-compute tradeoff를 달성함. 예: Qwen3-4B AIME에서 DEER 13400 토큰 vs CoDE-Stop 12800 토큰으로 비슷한 정확도
GSM8K에서 Qwen3-4B 기준 compression rate 52.4%로 가장 강한 압축률 달성 (Vanilla 2306 → CoDE-Stop 1233 토큰), 정확도 94.8% → 94.6%로 거의 유지

How to Apply

HuggingFace에서 Qwen3, DeepSeek-R1, Nemotron 같은 reasoning 모델을 쓰는 경우, GitHub 코드를 가져와 inference 루프에 CoDE-Stop을 래핑하면 됨. 추가 학습 없이 max_new_tokens만 32K로 설정하고 δ=0.55, rmax=0.95로 시작하면 됨.
추론 중간에 'Wait' 또는 '\n\n' 같은 reasoning step delimiter 토큰이 나올 때마다 모델에게 중간 답변을 생성시켜 confidence를 측정하고, degeneration score가 threshold τ를 넘으면 그 시점에서 최종 답변을 강제로 생성하면 됨.
비용 vs 정확도 트레이드오프를 조절하려면 τ(degeneration threshold) 하나만 바꾸면 됨. 논문에서 smooth한 tradeoff curve가 확인됐으므로, τ를 낮추면 더 빨리 멈추고 토큰 절약, 높이면 더 오래 추론해서 정확도 회복.

Code Example

snippet

# CoDE-Stop 핵심 로직 (pseudo-code)
import torch

def compute_confidence(model, tokenizer, context, answer_prompt):
    """중간 답변 생성 후 토큰 평균 확률로 confidence 계산"""
    input_ids = tokenizer(context + answer_prompt, return_tensors='pt').input_ids
    with torch.no_grad():
        outputs = model.generate(
            input_ids, 
            max_new_tokens=50,
            return_dict_in_generate=True,
            output_scores=True
        )
    # 생성된 답변 토큰들의 평균 확률
    scores = torch.stack(outputs.scores, dim=1)  # [1, seq_len, vocab]
    probs = torch.softmax(scores, dim=-1)
    token_probs = probs[0, range(len(outputs.sequences[0]) - len(input_ids[0])), 
                       outputs.sequences[0][len(input_ids[0]):]]
    return token_probs.mean().item()

def code_stop_check(
    confidences,      # 지금까지 모인 confidence 값들 [c1, c2, ...]
    token_positions,  # 각 step의 토큰 위치 [T1, T2, ...]
    step_k,           # 현재 step 인덱스
    delta=0.55,       # instability 판단 threshold
    tau=7.0,          # degeneration score threshold (벤치마크에 따라 조정)
    r_min=0.0, r_max=0.95, ramp_steps=5  # confidence threshold 파라미터
):
    # 1. Ramping confidence threshold
    r_k = min(r_max, r_min + (r_max - r_min) / ramp_steps * step_k)
    c_k = confidences[-1]
    
    if c_k >= r_k:
        return True, "high_confidence"  # 충분히 확신함 → 멈춰!
    
    # 2. Degeneration score 계산
    D_k = 0.0
    T_K = token_positions[-1]  # 마지막(현재) 토큰 위치
    
    for i in range(1, len(confidences)):
        c_i = confidences[i]
        c_prev = confidences[i-1]
        T_i = token_positions[i]
        
        # instability indicator: confidence가 낮고 개선 안 되면 1
        v_i = 1 if (2 * c_i - c_prev < delta) else 0
        
        # log weighting: 초반 step에 더 높은 가중치
        w_i = torch.log(torch.tensor(T_K / T_i)).item() + 1
        
        D_k += w_i * v_i
    
    if D_k >= tau:
        return True, "degeneration"  # 계속 불안정 → 멈춰!
    
    return False, "continue"

# 사용 예시
ANSWER_PROMPT = "\n**Final Answer**\n\nThe final answer is \\boxed{"

confidences = []
token_positions = []

for step_k, (reasoning_chunk, token_pos) in enumerate(reasoning_stream):
    # 중간 confidence 측정
    c_k = compute_confidence(model, tokenizer, current_context, ANSWER_PROMPT)
    confidences.append(c_k)
    token_positions.append(token_pos)
    
    # 멈출지 확인
    should_stop, reason = code_stop_check(
        confidences, token_positions, step_k,
        delta=0.55, tau=7.1  # Qwen3-4B AIME 설정
    )
    
    if should_stop:
        print(f"Early stop at step {step_k}, reason: {reason}")
        final_answer = generate_final_answer(model, current_context)
        break

Terminology

Chain-of-Thought (CoT)모델이 최종 답을 내기 전에 중간 추론 과정을 단계별로 쭉 써내려가는 방식. 사람이 수학 문제 풀 때 풀이 과정 적는 것과 같음.

Confidence Score모델이 자기 답변에 대해 얼마나 확신하는지를 0~1 사이 숫자로 나타낸 것. 생성한 토큰들의 확률 평균으로 계산함.

Degeneration Score추론이 삽질하고 있는지 감지하는 누적 점수. 불안정한 confidence 패턴이 쌓일수록 점수가 올라가고, 임계값을 넘으면 추론을 멈춤.

Reasoning Trajectory모델이 최종 답을 내기까지 생성하는 전체 추론 흐름. 정답에 도달하는 경로(correct)와 헤매는 경로(incorrect)로 나뉨.

Overthinking모델이 이미 정답에 도달했는데도 계속 추론을 이어가는 현상. 토큰을 낭비하고 오히려 성능이 떨어지기도 함.

Inference-time모델을 추가 학습시키지 않고, 실제 사용(추론) 시점에 적용하는 방법. 학습 없이 바로 붙일 수 있어서 실용적.

Compression Rate (CR)얼마나 추론을 짧게 줄였는지 비율. 원래 길이 대비 줄인 비율로, 낮을수록 더 많이 압축한 것.

Related Resources

Original Abstract (Expand)

Large reasoning models rely on long chain-of-thought generation to solve complex problems, but extended reasoning often incurs substantial computational cost and can even degrade performance due to overthinking. A key challenge is determining when the model should stop reasoning and produce the final answer. In this work, we study the confidence of intermediate answers during reasoning and observe two characteristic behaviors: correct reasoning trajectories often reach high-confidence answers early, while incorrect rollouts tend to produce long, unproductive reasoning traces and exhibit less reliable confidence dynamics. Motivated by these observations, we propose CoDE-Stop (Confidence Dynamics Early Stop), an early stopping method that leverages the dynamics of intermediate answer confidence to decide when to terminate reasoning, requiring no additional training and easily integrating into existing models. We evaluate CoDE-Stop on diverse reasoning and science benchmarks across multiple models. Compared to prior early stopping methods, it achieves a more favorable accuracy-compute tradeoff and reduces total token usage by 25-50% compared to standard full-length reasoning. In addition, we provide analyses of confidence dynamics during reasoning, offering insights into how confidence changes in both correct and incorrect trajectories.