Think Only When You Need with Large Hybrid-Reasoning Models
TL;DR Highlight
Teaching LLMs to answer easy questions directly and activate Chain-of-Thought only for hard ones — 'hybrid reasoning' trained with RL.
Who Should Read
ML engineers wanting to deploy reasoning models like DeepSeek-R1 or o1 in production but worried about token costs and latency. Backend developers looking to cut LLM serving costs while maintaining accuracy on complex math/code problems.
Core Mechanics
- Existing reasoning models (DeepSeek-R1 etc.) waste thousands of tokens with <think> tags even for simple questions like 'hello' — the 'overthinking' problem
- LHRMs auto-select between <think> (deep reasoning) and <no_think> (direct answer) mode based on query context — no human annotation needed
- 2-stage training: (1) Hybrid Fine-Tuning with mixed thinking/no-thinking examples, (2) HGPO (Hybrid Group Policy Optimization) with RL to refine mode selection
- Mode selection accuracy reaches 93.8% while maintaining or improving task performance
Evidence
- LHRMs-7B scores 66.7 on AIME24, up from HFT-DPO-7B's 58.7 (13.6% improvement) and DeepSeek-R1-Distill-Qwen-7B's 55.5 (20.2% improvement)
- Hybrid Accuracy (HAcc): LHRMs-7B 71.9% vs HFT-DPO-7B 37.1% — mode selection accuracy improved 93.8%
- General capability benchmarks maintained or improved alongside reasoning improvements
How to Apply
- For custom fine-tuning: mix SFT data with thinking examples (<think> tag) and no-thinking examples (<no_think> tag) — label simple QA as no_think, math/code as think. The paper used 1.7M samples.
- With RL training budget: apply GRPO-based HGPO — increasing margin δ pushes No-Think mode preference for easy questions, reducing token waste.
- For inference-only optimization: add routing logic that classifies query difficulty and selects think/no_think mode before generation.
Code Example
# Select mode after classifying query complexity (implementing paper idea at prompt level)
import re
def classify_query(query: str) -> str:
"""Simple/complex query classification (simple rule-based example instead of FastText classifier)"""
complex_keywords = ['prove', 'solve', 'calculate', 'code', 'implement', 'debug', 'explain why']
if any(kw in query.lower() for kw in complex_keywords) or len(query) > 100:
return 'think'
return 'no_think'
def build_prompt(query: str) -> str:
mode = classify_query(query)
if mode == 'think':
# LRM style: encourage thinking
return f"{query}\n\nLet's think step by step."
else:
# Direct answer
return f"{query}\n\nAnswer directly and concisely."
# Example of actual LHRMs-style tokens
# - Complex query: <think>...reasoning process...</think> answer
# - Simple query: <no_think>answer</no_think>
print(build_prompt("Can you help me please?")) # → Direct answer mode
print(build_prompt("Solve: find largest |a|+|b|+|c| given |ax²+bx+c|≤1")) # → Think modeTerminology
Related Resources
Original Abstract (Expand)
Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model's capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.