Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Mar 20, 2025•Yang Sui, Yu-Neng Chuang, Guanchu Wang +8•View PDF

TL;DR Highlight

A comprehensive survey of methods to solve the 'overthinking problem' — where reasoning models like DeepSeek-R1 and OpenAI o1 generate unnecessarily long chains of thought.

Who Should Read

ML engineers or backend developers looking to optimize LLM inference costs. Especially teams trying to deploy reasoning models like OpenAI o1 and DeepSeek-R1 in production.

Core Mechanics

Reasoning models like DeepSeek-R1 and QwQ-32B exhibit 'Overthinking' — generating over 600 words of reasoning for simple questions like 'which is larger, 0.9 or 0.11?' OpenAI o1 costs $60/1M generated tokens, making this directly tied to cost.
Three solution directions: (1) train the model itself to reason more concisely via RL/SFT, (2) dynamically reduce reasoning output at inference time, (3) control length via input prompts.
Adding a Length Reward to RL (reinforcement learning) can shorten CoT length while maintaining accuracy. Kimi k1.5, O1-Pruner, and L1 all use this approach.
Fine-tuning with short CoT data via SFT is also effective — including C3oT, which uses GPT-4 as a compressor to shorten long reasoning chains before training, and TokenSkip, which skips tokens based on semantic importance.
Prompting alone can be effective: instructions like 'Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most' (Chain of Draft) significantly reduce token usage.
Data efficiency is also a key trend — LIMO outperforms models trained on 100,000 samples using only 817 high-quality examples, and s1-32B surpasses OpenAI o1-preview with just 1,000 samples.

Evidence

Applying strategies to reduce the overthinking score yielded a 30% performance improvement while simultaneously cutting computational costs by 43%.
A 1B parameter model can outperform a 405B model on MATH-500 with the right Test-Time Scaling strategy.
LIMO: achieves results surpassing models trained on 100,000+ samples using only 817 samples.
s1-32B: exceeds OpenAI o1-preview on MATH and AIME24 via SFT on 1,000 samples combined with budget forcing.

How to Apply

Immediately applicable with a single prompt line: add 'Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer after ####' to your system prompt to instantly reduce token usage.
Implement query difficulty-based routing: apply the RouteLLM pattern — sending simple questions to a fast general model and routing only complex questions to reasoning models like DeepSeek-R1/o1. This is effective for optimizing cost-performance tradeoffs.
Add a Length Reward when fine-tuning reasoning models: in your GRPO training loop, combine a Length Reward — which gives higher rewards for correct answers that are shorter — with the existing Accuracy Reward to reduce reasoning length while maintaining performance.

Code Example

snippet

# Chain of Draft prompt - immediately applicable
system_prompt = """Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end after a separator ####."""

# Token-Budget-Aware prompt
def make_budget_aware_prompt(question: str, budget: int) -> str:
    return f"""Please answer the following question. Let's think step by step and use less than {budget} tokens.

Question: {question}"""

# Length-constrained prompt variants (based on Token Complexity paper)
def make_constrained_prompt(question: str, constraint_type: str, k: int = None) -> str:
    constraints = {
        "word_limit": f"use at most {k} words",
        "step_limit": f"use at most {k} steps", 
        "token_limit": f"use at most {k} tokens",
        "concise": "Be concise.",  # CCoT approach
        "bullet": "only use bullet points"
    }
    constraint = constraints.get(constraint_type, "Be concise.")
    return f"{question}\n\nConstraint: {constraint}"

# Difficulty-based routing example
def route_query(question: str, confidence_threshold: float = 0.8):
    """
    Simple questions -> fast model
    Complex questions -> reasoning model
    """
    # Step 1: try with a fast model first
    quick_response = call_fast_model(question)  # e.g., GPT-4o-mini
    confidence = estimate_confidence(quick_response)
    
    if confidence >= confidence_threshold:
        return quick_response  # return directly if confident
    else:
        # escalate to reasoning model if uncertain
        return call_reasoning_model(question)  # e.g., DeepSeek-R1

Terminology

CoT (Chain-of-Thought)A method where the model writes out intermediate reasoning steps before arriving at a final answer — similar to showing your work when solving a math problem.

OverthinkingA phenomenon where a model generates thousands of tokens of unnecessary reasoning even for simple questions — like pulling out a math textbook to calculate 1+1.

RL (Reinforcement Learning)A training approach where a model learns through trial and error guided by reward signals — similar to how a game AI improves by receiving points as rewards.

SFT (Supervised Fine-Tuning)A training method where the model learns by imitating well-crafted example data — similar to studying from a model answer key.

GRPOAn RL optimization algorithm used by DeepSeek. A variant of PPO that groups multiple samples together and learns by comparing their relative quality.

LoRAA fine-tuning technique that adds small adapter layers rather than modifying the entire model — effective even when training less than 1% of total parameters.

Test-Time ScalingA method that improves performance by allocating more computation at inference time, without changing training — similar to spending more time thinking during an exam.

Process Reward Model (PRM)A reward model that evaluates not just the final answer but each individual intermediate reasoning step — acting like a teacher who grades the solution process rather than just the result.

Related Resources

Awesome-Efficient-Reasoning-LLMs (GitHub)

Original Abstract (Expand)

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the"overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs