Self-Training Elicits Concise Reasoning in Large Language Models

Feb 27, 2025•Tergel Munkhbat, Namgyu Ho, Seohyun Kim +3•View PDF

TL;DR Highlight

Using self-training to unlock the LLM's existing ability to reason concisely — reducing tokens by 30%.

Who Should Read

ML engineers wanting to reduce inference costs for reasoning models. Especially teams running CoT-based math/logical reasoning tasks.

Core Mechanics

LLMs already have the potential to produce short, accurate reasoning paths — much shorter correct paths exist within the output distribution for the same problem
'Be concise' zero-shot prompts are nearly ineffective on math-specialized models (Qwen2.5-Math, etc.), and when effective cause accuracy loss
Best-of-N sampling (select shortest correct answer from N samples) collects concise training data; few-shot examples guide shorter reasoning before fine-tuning
FS-BoN (Few-Shot + Best-of-N self-training) achieves avg 30% token reduction + maintained accuracy — 2.4x improvement vs existing fine-tuning baseline
External data (GPT-4o CoT) fine-tuning reduces length but causes significant accuracy loss — self-training is far better at preserving accuracy
Token savings automatically scale with problem difficulty — easy problems see 20-40% reduction, hard problems less

Evidence

FS-GPT4o-BoN: avg 35.75% token reduction on GSM8K (relative length 64.25%), accuracy at 97% — only 3% loss vs baseline
Actual wall-clock inference time reduced 15.38%-52.94% (H100 single GPU, vLLM)
MMLU-Pro non-math domains (business/chemistry/physics): avg 26.82% length reduction + 16.51% accuracy improvement
Naive BoN vs few-shot conditioning: N=256 BoN and N=8 few-shot give similar results

How to Apply

For target task training data: sample N=16 reasoning paths per question, select the shortest correct one as fine-tuning data (Naive BoN).
Include 8 concise reasoning examples from GPT-4o or human experts as few-shot for sampling — produces shorter paths, makes the model automatically answer concisely even in zero-shot.
To cover hard problems where few-shot fails to produce correct answers: combine few-shot samples + basic BoN samples and pick the shortest correct one.

Code Example

snippet

# FS-GPT4o-BoN Pipeline Overview (pseudo-code)

# 1. Prepare few-shot examples (generate concise reasoning with GPT-4o)
few_shot_examples = [
    {"question": "15 + 6 = ?", "solution": "15 + 6 = 21. The answer is 21."},
    # ... prepare 8 examples
]

# 2. Sample short reasoning paths with few-shot prompt
def build_few_shot_prompt(question, examples):
    prompt = ""
    for ex in examples:
        prompt += f"User: {ex['question']}\nAssistant: {ex['solution']}\n\n"
    prompt += f"User: {question}\nAssistant:"
    return prompt

# 3. Best-of-N: select the shortest correct answer among N samples
def select_shortest_correct(samples, label):
    correct = [s for s in samples if extract_answer(s) == label]
    if not correct:
        return None
    return min(correct, key=lambda s: len(s.split()))

# 4. Fine-tune on selected short reasoning paths
training_data = []
for question, label in dataset:
    # Few-shot conditioned sampling
    fs_samples = model.generate(
        build_few_shot_prompt(question, few_shot_examples),
        n=16, temperature=0.7
    )
    # Default BoN sampling for augmentation
    bon_samples = model.generate(default_prompt(question), n=16, temperature=0.7)
    
    all_samples = fs_samples + bon_samples
    best = select_shortest_correct(all_samples, label)
    if best:
        training_data.append({"input": question, "output": best})

# Standard SFT fine-tuning on training_data
model.finetune(training_data, epochs=1, lr=1e-5)

Terminology

Chain-of-Thought (CoT)A technique having LLMs write out intermediate reasoning steps before the final answer.

Best-of-N (BoN) samplingGenerate N answers to the same question and pick the best one.

Self-trainingA method where the model generates its own data and learns from it.

Few-shot conditioningShowing a few examples in the prompt to guide the desired format/style.

Fine-tuningAdditionally training an already-trained model on specific task data to specialize it.

Inference costComputational cost for generating responses. More tokens means more cost and time.

Greedy decodingGenerating responses by always selecting the highest-probability token. Deterministic — always produces the same result.

Related Resources

GitHub: concise-reasoning

Original Abstract (Expand)

Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning