Self-Training으로 LLM의 간결한 Chain-of-Thought Reasoning 유도하기

Self-Training Elicits Concise Reasoning in Large Language Models

Feb 27, 2025•Tergel Munkhbat, Namgyu Ho, Seohyun Kim +3•View PDF

TL;DR Highlight

LLM이 이미 갖고 있는 '짧게 추론하는 능력'을 self-training으로 깨워서 토큰을 30% 줄이는 방법

Who Should Read

추론 모델의 inference 비용을 줄이고 싶은 ML 엔지니어. 특히 CoT(Chain-of-Thought) 기반 수학/논리 추론 태스크를 운영 중인 팀.

Core Mechanics

LLM은 이미 짧고 정확한 추론 경로를 낼 수 있는 잠재력이 있음 - 같은 문제에서 평균보다 훨씬 짧은 정답 경로가 출력 분포 안에 존재
'Be concise' 같은 zero-shot 프롬프트는 수학 특화 모델(Qwen2.5-Math 등)에 거의 효과 없고, 효과 있어도 정확도 손실이 큼
Best-of-N sampling(N개 중 가장 짧은 정답 선택)으로 concise 학습 데이터를 모으고, few-shot 예시로 더 짧은 추론을 유도한 뒤 fine-tuning
FS-BoN(Few-Shot + Best-of-N self-training) 방식이 평균 30% 토큰 감소 + 정확도 유지 달성 - 기존 fine-tuning 베이스라인 대비 2.4배 개선
외부 데이터(GPT-4o CoT)로 fine-tuning하면 길이는 줄지만 정확도가 크게 손실 - self-training이 정확도 보존에 훨씬 유리
문제 난이도에 따라 토큰 절감이 자동 조절됨 - 쉬운 문제는 20~40% 감소, 어려운 문제는 덜 줄어듦

Evidence

FS-GPT4o-BoN 기준 GSM8K에서 평균 35.75% 토큰 감소(상대 길이 64.25%), 정확도는 97%로 기준 대비 3% 손실에 그침
실제 wall-clock 추론 시간 15.38%~52.94% 감소 (H100 단일 GPU, vLLM 기준)
MMLU-Pro 비수학 도메인(비즈니스/화학/물리)에서도 평균 26.82% 길이 감소 + 16.51% 정확도 향상
Naive BoN 대비 few-shot conditioning이 같은 샘플 수에서 훨씬 짧은 추론 생성 - N=256 BoN과 N=8 few-shot 결과가 유사

How to Apply

타겟 태스크의 학습 데이터에서 각 질문마다 N=16개 추론 경로를 샘플링하고, 정답인 것 중 가장 짧은 것을 선택해 fine-tuning 데이터로 사용 (Naive BoN)
GPT-4o나 사람이 작성한 간결한 추론 예시 8개를 few-shot으로 포함해서 샘플링하면 더 짧은 경로가 나옴 - 이걸 학습 데이터로 쓰면 zero-shot 추론 시에도 모델이 자동으로 짧게 답변
어려운 문제에서 few-shot 방식이 정답을 못 내는 경우를 커버하려면, few-shot 샘플 + 기본 BoN 샘플을 합쳐서 가장 짧은 정답을 고르는 augmentation 전략 병행

Code Example

snippet

# FS-GPT4o-BoN 파이프라인 개요 (pseudo-code)

# 1. Few-shot 예시 준비 (GPT-4o로 간결한 추론 생성)
few_shot_examples = [
    {"question": "15 + 6 = ?", "solution": "15 + 6 = 21. The answer is 21."},
    # ... 8개 준비
]

# 2. Few-shot 프롬프트로 짧은 추론 경로 샘플링
def build_few_shot_prompt(question, examples):
    prompt = ""
    for ex in examples:
        prompt += f"User: {ex['question']}\nAssistant: {ex['solution']}\n\n"
    prompt += f"User: {question}\nAssistant:"
    return prompt

# 3. Best-of-N: N개 샘플 중 정답이면서 가장 짧은 것 선택
def select_shortest_correct(samples, label):
    correct = [s for s in samples if extract_answer(s) == label]
    if not correct:
        return None
    return min(correct, key=lambda s: len(s.split()))

# 4. 선택된 짧은 경로로 fine-tuning
training_data = []
for question, label in dataset:
    # Few-shot conditioned sampling
    fs_samples = model.generate(
        build_few_shot_prompt(question, few_shot_examples),
        n=16, temperature=0.7
    )
    # Default BoN sampling for augmentation
    bon_samples = model.generate(default_prompt(question), n=16, temperature=0.7)
    
    all_samples = fs_samples + bon_samples
    best = select_shortest_correct(all_samples, label)
    if best:
        training_data.append({"input": question, "output": best})

# Standard SFT fine-tuning on training_data
model.finetune(training_data, epochs=1, lr=1e-5)

Terminology

Chain-of-Thought (CoT)LLM이 최종 답 전에 중간 추론 과정을 단계별로 쓰게 하는 기법. 수학 문제 풀 때 풀이 과정 보여주는 것과 같음.

Best-of-N (BoN) sampling같은 질문에 N번 답변을 생성하고 그 중 가장 좋은 것을 고르는 방식. N개 응시자 중 1등을 뽑는 것.

Self-training모델 자신이 생성한 데이터로 자기 자신을 다시 학습시키는 방법. 외부 데이터 없이 스스로 개선.

Few-shot conditioning프롬프트에 예시 몇 개를 보여줘서 원하는 형식/스타일로 답하게 유도하는 것. 시험 전 예제 풀이 보여주는 것과 비슷.

Fine-tuning이미 학습된 모델을 특정 태스크 데이터로 추가 학습시켜 해당 태스크에 특화시키는 것.

Inference cost모델이 답변을 생성할 때 드는 계산 비용. 토큰이 많을수록 비용과 시간이 늘어남.

Greedy decoding각 단계에서 가장 확률 높은 토큰만 선택해 답변을 생성하는 방식. 무작위성 없이 항상 같은 결과가 나옴.

Related Resources

GitHub: concise-reasoning

Original Abstract (Expand)

Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning