Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
TL;DR Highlight
A comprehensive survey of methods to solve the 'overthinking problem' — where reasoning models like DeepSeek-R1 and OpenAI o1 generate unnecessarily long chains of thought.
Who Should Read
ML engineers or backend developers looking to optimize LLM inference costs. Especially teams trying to deploy reasoning models like OpenAI o1 and DeepSeek-R1 in production.
Core Mechanics
- Reasoning models like DeepSeek-R1 and QwQ-32B exhibit 'Overthinking' — generating over 600 words of reasoning for simple questions like 'which is larger, 0.9 or 0.11?' OpenAI o1 costs $60/1M generated tokens, making this directly tied to cost.
- Three solution directions: (1) train the model itself to reason more concisely via RL/SFT, (2) dynamically reduce reasoning output at inference time, (3) control length via input prompts.
- Adding a Length Reward to RL (reinforcement learning) can shorten CoT length while maintaining accuracy. Kimi k1.5, O1-Pruner, and L1 all use this approach.
- Fine-tuning with short CoT data via SFT is also effective — including C3oT, which uses GPT-4 as a compressor to shorten long reasoning chains before training, and TokenSkip, which skips tokens based on semantic importance.
- Prompting alone can be effective: instructions like 'Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most' (Chain of Draft) significantly reduce token usage.
- Data efficiency is also a key trend — LIMO outperforms models trained on 100,000 samples using only 817 high-quality examples, and s1-32B surpasses OpenAI o1-preview with just 1,000 samples.
Evidence
- Applying strategies to reduce the overthinking score yielded a 30% performance improvement while simultaneously cutting computational costs by 43%.
- A 1B parameter model can outperform a 405B model on MATH-500 with the right Test-Time Scaling strategy.
- LIMO: achieves results surpassing models trained on 100,000+ samples using only 817 samples.
- s1-32B: exceeds OpenAI o1-preview on MATH and AIME24 via SFT on 1,000 samples combined with budget forcing.
How to Apply
- Immediately applicable with a single prompt line: add 'Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer after ####' to your system prompt to instantly reduce token usage.
- Implement query difficulty-based routing: apply the RouteLLM pattern — sending simple questions to a fast general model and routing only complex questions to reasoning models like DeepSeek-R1/o1. This is effective for optimizing cost-performance tradeoffs.
- Add a Length Reward when fine-tuning reasoning models: in your GRPO training loop, combine a Length Reward — which gives higher rewards for correct answers that are shorter — with the existing Accuracy Reward to reduce reasoning length while maintaining performance.
Code Example
# Chain of Draft prompt - immediately applicable
system_prompt = """Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end after a separator ####."""
# Token-Budget-Aware prompt
def make_budget_aware_prompt(question: str, budget: int) -> str:
return f"""Please answer the following question. Let's think step by step and use less than {budget} tokens.
Question: {question}"""
# Length-constrained prompt variants (based on Token Complexity paper)
def make_constrained_prompt(question: str, constraint_type: str, k: int = None) -> str:
constraints = {
"word_limit": f"use at most {k} words",
"step_limit": f"use at most {k} steps",
"token_limit": f"use at most {k} tokens",
"concise": "Be concise.", # CCoT approach
"bullet": "only use bullet points"
}
constraint = constraints.get(constraint_type, "Be concise.")
return f"{question}\n\nConstraint: {constraint}"
# Difficulty-based routing example
def route_query(question: str, confidence_threshold: float = 0.8):
"""
Simple questions -> fast model
Complex questions -> reasoning model
"""
# Step 1: try with a fast model first
quick_response = call_fast_model(question) # e.g., GPT-4o-mini
confidence = estimate_confidence(quick_response)
if confidence >= confidence_threshold:
return quick_response # return directly if confident
else:
# escalate to reasoning model if uncertain
return call_reasoning_model(question) # e.g., DeepSeek-R1Terminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the"overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs