Think Only When You Need with Large Hybrid-Reasoning Models

May 20, 2025•Lingjie Jiang, Xun Wu, Shaohan Huang +7•View PDF

TL;DR Highlight

Teaching LLMs to answer easy questions directly and activate Chain-of-Thought only for hard ones — 'hybrid reasoning' trained with RL.

Who Should Read

ML engineers wanting to deploy reasoning models like DeepSeek-R1 or o1 in production but worried about token costs and latency. Backend developers looking to cut LLM serving costs while maintaining accuracy on complex math/code problems.

Core Mechanics

Existing reasoning models (DeepSeek-R1 etc.) waste thousands of tokens with <think> tags even for simple questions like 'hello' — the 'overthinking' problem
LHRMs auto-select between <think> (deep reasoning) and <no_think> (direct answer) mode based on query context — no human annotation needed
2-stage training: (1) Hybrid Fine-Tuning with mixed thinking/no-thinking examples, (2) HGPO (Hybrid Group Policy Optimization) with RL to refine mode selection
Mode selection accuracy reaches 93.8% while maintaining or improving task performance

Evidence

LHRMs-7B scores 66.7 on AIME24, up from HFT-DPO-7B's 58.7 (13.6% improvement) and DeepSeek-R1-Distill-Qwen-7B's 55.5 (20.2% improvement)
Hybrid Accuracy (HAcc): LHRMs-7B 71.9% vs HFT-DPO-7B 37.1% — mode selection accuracy improved 93.8%
General capability benchmarks maintained or improved alongside reasoning improvements

How to Apply

For custom fine-tuning: mix SFT data with thinking examples (<think> tag) and no-thinking examples (<no_think> tag) — label simple QA as no_think, math/code as think. The paper used 1.7M samples.
With RL training budget: apply GRPO-based HGPO — increasing margin δ pushes No-Think mode preference for easy questions, reducing token waste.
For inference-only optimization: add routing logic that classifies query difficulty and selects think/no_think mode before generation.

Code Example

snippet

# Select mode after classifying query complexity (implementing paper idea at prompt level)
import re

def classify_query(query: str) -> str:
    """Simple/complex query classification (simple rule-based example instead of FastText classifier)"""
    complex_keywords = ['prove', 'solve', 'calculate', 'code', 'implement', 'debug', 'explain why']
    if any(kw in query.lower() for kw in complex_keywords) or len(query) > 100:
        return 'think'
    return 'no_think'

def build_prompt(query: str) -> str:
    mode = classify_query(query)
    if mode == 'think':
        # LRM style: encourage thinking
        return f"{query}\n\nLet's think step by step."
    else:
        # Direct answer
        return f"{query}\n\nAnswer directly and concisely."

# Example of actual LHRMs-style tokens
# - Complex query: <think>...reasoning process...</think> answer
# - Simple query: <no_think>answer</no_think>

print(build_prompt("Can you help me please?"))  # → Direct answer mode
print(build_prompt("Solve: find largest |a|+|b|+|c| given |ax²+bx+c|≤1"))  # → Think mode

Terminology

LRMLarge Reasoning Model. Models like DeepSeek-R1 and o1 that generate long thinking processes before answering. Much stronger on math/code than regular LLMs, but overthink simple questions.

Chain-of-ThoughtThe model outputs intermediate reasoning steps ('Step 1: ... Step 2: ...') before the final answer. Like showing your work on an exam.

GRPOGroup Relative Policy Optimization. An RL algorithm that evaluates outputs relative to a group, without needing a separate reward model.

Related Resources

Original Abstract (Expand)

Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model's capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.