Self-Training Elicits Concise Reasoning in Large Language Models
TL;DR Highlight
Using self-training to unlock the LLM's existing ability to reason concisely — reducing tokens by 30%.
Who Should Read
ML engineers wanting to reduce inference costs for reasoning models. Especially teams running CoT-based math/logical reasoning tasks.
Core Mechanics
- LLMs already have the potential to produce short, accurate reasoning paths — much shorter correct paths exist within the output distribution for the same problem
- 'Be concise' zero-shot prompts are nearly ineffective on math-specialized models (Qwen2.5-Math, etc.), and when effective cause accuracy loss
- Best-of-N sampling (select shortest correct answer from N samples) collects concise training data; few-shot examples guide shorter reasoning before fine-tuning
- FS-BoN (Few-Shot + Best-of-N self-training) achieves avg 30% token reduction + maintained accuracy — 2.4x improvement vs existing fine-tuning baseline
- External data (GPT-4o CoT) fine-tuning reduces length but causes significant accuracy loss — self-training is far better at preserving accuracy
- Token savings automatically scale with problem difficulty — easy problems see 20-40% reduction, hard problems less
Evidence
- FS-GPT4o-BoN: avg 35.75% token reduction on GSM8K (relative length 64.25%), accuracy at 97% — only 3% loss vs baseline
- Actual wall-clock inference time reduced 15.38%-52.94% (H100 single GPU, vLLM)
- MMLU-Pro non-math domains (business/chemistry/physics): avg 26.82% length reduction + 16.51% accuracy improvement
- Naive BoN vs few-shot conditioning: N=256 BoN and N=8 few-shot give similar results
How to Apply
- For target task training data: sample N=16 reasoning paths per question, select the shortest correct one as fine-tuning data (Naive BoN).
- Include 8 concise reasoning examples from GPT-4o or human experts as few-shot for sampling — produces shorter paths, makes the model automatically answer concisely even in zero-shot.
- To cover hard problems where few-shot fails to produce correct answers: combine few-shot samples + basic BoN samples and pick the shortest correct one.
Code Example
# FS-GPT4o-BoN Pipeline Overview (pseudo-code)
# 1. Prepare few-shot examples (generate concise reasoning with GPT-4o)
few_shot_examples = [
{"question": "15 + 6 = ?", "solution": "15 + 6 = 21. The answer is 21."},
# ... prepare 8 examples
]
# 2. Sample short reasoning paths with few-shot prompt
def build_few_shot_prompt(question, examples):
prompt = ""
for ex in examples:
prompt += f"User: {ex['question']}\nAssistant: {ex['solution']}\n\n"
prompt += f"User: {question}\nAssistant:"
return prompt
# 3. Best-of-N: select the shortest correct answer among N samples
def select_shortest_correct(samples, label):
correct = [s for s in samples if extract_answer(s) == label]
if not correct:
return None
return min(correct, key=lambda s: len(s.split()))
# 4. Fine-tune on selected short reasoning paths
training_data = []
for question, label in dataset:
# Few-shot conditioned sampling
fs_samples = model.generate(
build_few_shot_prompt(question, few_shot_examples),
n=16, temperature=0.7
)
# Default BoN sampling for augmentation
bon_samples = model.generate(default_prompt(question), n=16, temperature=0.7)
all_samples = fs_samples + bon_samples
best = select_shortest_correct(all_samples, label)
if best:
training_data.append({"input": question, "output": best})
# Standard SFT fine-tuning on training_data
model.finetune(training_data, epochs=1, lr=1e-5)Terminology
Related Resources
Original Abstract (Expand)
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning