생각을 멈춰라: Large Language Model의 Efficient Reasoning 서베이

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Mar 20, 2025•Yang Sui, Yu-Neng Chuang, Guanchu Wang +8•View PDF

TL;DR Highlight

DeepSeek-R1, OpenAI o1 같은 추론 모델이 쓸데없이 길게 생각하는 '과잉사고 문제'를 해결하는 방법들을 총정리한 서베이.

Who Should Read

LLM 추론 비용이 너무 높아서 최적화 방법을 찾는 ML 엔지니어나 백엔드 개발자. 특히 OpenAI o1, DeepSeek-R1 같은 추론 모델을 프로덕션에 적용하려는 팀.

Core Mechanics

DeepSeek-R1, QwQ-32B 같은 추론 모델이 '0.9와 0.11 중 뭐가 크냐'는 간단한 질문에 600단어 넘게 생각하는 '과잉사고(Overthinking)' 현상 존재. OpenAI o1은 생성 토큰당 $60/1M으로 비용 직결됨
해결 방향 3가지: (1) 모델 자체를 짧게 학습(RL/SFT), (2) 추론 출력 단계에서 동적으로 줄이기, (3) 입력 프롬프트로 길이 제어
RL(강화학습)에 길이 보상(Length Reward)을 추가하면 정확도 유지하면서 CoT 길이 단축 가능. Kimi k1.5, O1-Pruner, L1 등이 이 방식 사용
SFT로 짧은 CoT 데이터를 수집해 파인튜닝하는 방법도 효과적. GPT-4를 압축기로 써서 긴 추론을 짧게 만든 뒤 학습하는 C3oT, 의미 중요도 기반으로 토큰 스킵하는 TokenSkip 등
프롬프트만으로도 효과 있음: 'Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most'(Chain of Draft) 같은 지시로 토큰 사용량 대폭 감소
817개 고품질 샘플만으로 100,000개 데이터 쓴 모델을 이기는 LIMO, 1,000개 샘플로 OpenAI o1-preview를 넘는 s1-32B 등 데이터 효율성도 핵심 트렌드

Evidence

과잉사고 점수를 낮추는 전략 적용 시 성능 30% 향상, 동시에 연산 비용 43% 절감
1B 파라미터 모델이 적절한 Test-Time Scaling 전략으로 405B 모델을 MATH-500에서 능가 가능
LIMO: 817개 샘플만으로 100,000개 이상 데이터를 사용한 기존 모델 초과 달성
s1-32B: 1,000개 샘플 SFT + budget forcing으로 MATH, AIME24에서 OpenAI o1-preview 초과

How to Apply

프롬프트 한 줄 추가로 즉시 적용 가능: 시스템 프롬프트에 'Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer after ####' 추가하면 토큰 사용량 즉시 감소
쿼리 난이도 기반 라우팅 구현: 간단한 질문은 빠른 일반 모델로, 복잡한 질문만 DeepSeek-R1/o1 같은 추론 모델로 보내는 RouteLLM 패턴 적용. 비용 대비 성능 최적화에 효과적
추론 모델 파인튜닝 시 길이 보상 추가: GRPO 학습 루프에서 정답이면서 짧을수록 높은 보상을 주는 Length Reward를 기존 Accuracy Reward에 결합하면 추론 길이를 줄이면서 성능 유지 가능

Code Example

snippet

# Chain of Draft 프롬프트 - 즉시 적용 가능
system_prompt = """Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end after a separator ####."""

# Token-Budget-Aware 프롬프트
def make_budget_aware_prompt(question: str, budget: int) -> str:
    return f"""Please answer the following question. Let's think step by step and use less than {budget} tokens.

Question: {question}"""

# 길이 제어 프롬프트 변형들 (Token Complexity 논문 기반)
def make_constrained_prompt(question: str, constraint_type: str, k: int = None) -> str:
    constraints = {
        "word_limit": f"use at most {k} words",
        "step_limit": f"use at most {k} steps", 
        "token_limit": f"use at most {k} tokens",
        "concise": "Be concise.",  # CCoT 방식
        "bullet": "only use bullet points"
    }
    constraint = constraints.get(constraint_type, "Be concise.")
    return f"{question}\n\nConstraint: {constraint}"

# 난이도 기반 라우팅 예시
def route_query(question: str, confidence_threshold: float = 0.8):
    """
    간단한 질문 -> 빠른 모델
    복잡한 질문 -> 추론 모델
    """
    # 1단계: 빠른 모델로 먼저 시도
    quick_response = call_fast_model(question)  # e.g., GPT-4o-mini
    confidence = estimate_confidence(quick_response)
    
    if confidence >= confidence_threshold:
        return quick_response  # 확신 있으면 그냥 반환
    else:
        # 불확실하면 추론 모델로 에스컬레이션
        return call_reasoning_model(question)  # e.g., DeepSeek-R1

Terminology

CoT (Chain-of-Thought)모델이 최종 답을 내기 전에 중간 사고 과정을 단계별로 적는 방식. 수학 문제 풀 때 풀이 과정 보여주는 것과 같음.

Overthinking모델이 간단한 질문에도 수천 토큰 분량의 불필요한 사고 과정을 생성하는 현상. 사람으로 치면 1+1을 계산할 때 수학 원론부터 꺼내는 격.

RL (Reinforcement Learning)모델이 시행착오를 거치며 보상 신호를 통해 스스로 학습하는 방법. 게임 AI가 점수를 보상으로 받으며 실력을 키우는 것과 같은 원리.

SFT (Supervised Fine-Tuning)잘 만든 예시 데이터를 보여주고 따라하게 학습시키는 방법. 모범 답안 보고 따라 푸는 것과 비슷.

GRPODeepSeek이 사용한 RL 최적화 알고리즘. 여러 개의 샘플을 그룹으로 묶어 상대적 품질을 비교하며 학습하는 PPO의 변형.

LoRA모델 전체를 다 바꾸지 않고 작은 어댑터 레이어만 추가해서 학습하는 파인튜닝 기법. 전체 파라미터의 1% 미만만 학습해도 효과가 남.

Test-Time Scaling학습은 그대로 두고, 추론(inference) 단계에서 더 많은 계산을 써서 성능을 높이는 방법. 시험 볼 때 더 오래 생각하는 것과 비슷.

Process Reward Model (PRM)최종 답뿐 아니라 중간 추론 단계 하나하나를 평가하는 보상 모델. 정답보다 풀이 과정 자체를 채점하는 선생님 역할.

Related Resources

Awesome-Efficient-Reasoning-LLMs (GitHub)

Original Abstract (Expand)

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the"overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs