필요할 때만 생각하는 Large Hybrid-Reasoning Models

Think Only When You Need with Large Hybrid-Reasoning Models

May 20, 2025•Lingjie Jiang, Xun Wu, Shaohan Huang +7•View PDF

TL;DR Highlight

LLM이 쉬운 질문엔 바로 답하고 어려운 질문엔 Chain-of-Thought를 켜는 '하이브리드 추론'을 RL로 학습시키는 방법.

Who Should Read

DeepSeek-R1이나 o1 같은 추론 모델을 프로덕션에 붙이려다 토큰 비용과 응답 지연 때문에 고민하는 ML 엔지니어. 또는 LLM 서빙 비용을 줄이면서 복잡한 수학·코드 문제 정확도를 유지하고 싶은 백엔드 개발자.

Core Mechanics

기존 추론 모델(DeepSeek-R1 등)은 '안녕하세요' 같은 단순 질문에도 <think> 태그로 수천 토큰을 낭비하는 '오버싱킹' 문제가 있음
LHRMs는 쿼리 맥락을 보고 <think> (깊이 생각) vs <no_think> (바로 답변) 모드를 자동 선택 — 사람이 별도로 지정 안 해도 됨
2단계 학습: ① Hybrid Fine-Tuning(HFT)으로 두 모드 지원 능력 초기화 → ② Hybrid Group Policy Optimization(HGPO)으로 '언제 생각할지' 강화학습
HGPO는 같은 쿼리에 Thinking/No-Thinking 두 그룹으로 응답을 샘플링하고, 어느 모드가 더 나은지(inter-group)와 같은 모드 내 어느 응답이 더 나은지(intra-group)를 동시에 학습
모델 크기가 클수록(7B) RL 학습 후 No-Thinking 비율이 높아지고, 작은 모델(1.5B)은 오히려 Thinking을 더 많이 사용하는 경향 — 모델 역량에 따라 자동 보정
수학 도메인 RL만으로 코드 도메인에서도 하이브리드 추론 패턴이 전이됨 — 도메인 외 일반화 능력 확인

Evidence

LHRMs-7B가 AIME24에서 66.7점으로 HFT-DPO-7B(58.7) 대비 13.6% 향상, 같은 베이스의 DeepSeek-R1-Distill-Qwen-7B(55.5) 대비 20.2% 향상
Hybrid Accuracy(HAcc): LHRMs-7B 71.9% vs HFT-DPO-7B 37.1% — 모드 선택 정확도 93.8% 개선
일반 능력 벤치마크에서 LHRMs-7B가 HFT-DPO-7B 대비 AlpacaEval 50.2% 향상, Arena-Hard 93.4% 향상
LHRMs-1.5B가 MBPP에서 HFT-DPO-1.5B 대비 11.1%, MBPP+에서 13.9% 향상 (수학 데이터만으로 RL 학습했음에도)

How to Apply

자체 파인튜닝 시: SFT 데이터를 thinking 예제(<think>태그)와 no-thinking 예제(<no_think>태그)로 혼합 구성 — 단순 QA는 no_think, 수학·코드는 think로 레이블링해서 1.7M 규모로 구성한 것이 핵심
RL 학습 예산이 있다면: GRPO 기반 HGPO 방식 적용 — margin δ 값을 높이면 No-Thinking 편향(응답 속도 우선), 낮추면 Thinking 편향(정확도 우선)으로 서비스 요구에 맞게 튜닝 가능
당장 모델 학습 없이 활용하려면: 프롬프트에 명시적으로 'think step by step'(복잡한 쿼리)과 직접 답변(단순 쿼리) 라우팅 레이어를 쿼리 분류기로 구현하는 형태로 이 논문의 아이디어를 흉내낼 수 있음

Code Example

snippet

# 쿼리 복잡도 분류 후 모드 선택 (논문 아이디어를 프롬프트 레벨로 구현)
import re

def classify_query(query: str) -> str:
    """단순/복잡 쿼리 분류 (FastText 분류기 대신 간단한 규칙 예시)"""
    complex_keywords = ['prove', 'solve', 'calculate', 'code', 'implement', 'debug', 'explain why']
    if any(kw in query.lower() for kw in complex_keywords) or len(query) > 100:
        return 'think'
    return 'no_think'

def build_prompt(query: str) -> str:
    mode = classify_query(query)
    if mode == 'think':
        # LRM 스타일: 생각 유도
        return f"{query}\n\nLet's think step by step."
    else:
        # 직접 답변
        return f"{query}\n\nAnswer directly and concisely."

# 실제 LHRMs 스타일 토큰 예시
# - 복잡한 쿼리: <think>...추론 과정...</think> 정답
# - 단순 쿼리: <no_think>정답</no_think>

print(build_prompt("Can you help me please?"))  # → 직접 답변 모드
print(build_prompt("Solve: find largest |a|+|b|+|c| given |ax²+bx+c|≤1"))  # → think 모드

Terminology

LRMLarge Reasoning Model. DeepSeek-R1, o1처럼 답 내기 전에 긴 사고 과정을 생성하는 모델. 일반 LLM보다 수학·코드에서 훨씬 강하지만 단순 질문에도 오래 생각하는 게 단점.

Chain-of-Thought모델이 최종 답 전에 '1단계: ... 2단계: ...' 식으로 중간 추론 과정을 출력하는 기법. 인간이 문제 풀 때 풀이를 쓰는 것과 같음.

GRPOGroup Relative Policy Optimization. 강화학습에서 같은 질문에 여러 답을 뽑아 서로 비교해 좋은 답의 확률을 높이는 방법. 별도 가치 함수 모델 없이 동작해 학습 비용이 낮음.

HGPO이 논문이 제안한 Hybrid Group Policy Optimization. GRPO를 확장해 Thinking/No-Thinking 두 모드 간 비교(inter-group)와 같은 모드 내 비교(intra-group)를 동시에 학습.

HFTHybrid Fine-Tuning. RL 전 초기화 단계로, think 예제와 no_think 예제를 섞어서 일반 지도학습을 하는 것. RL cold-start 불안정성을 방지하기 위한 워밍업.

Hybrid Accuracy이 논문이 제안한 평가 지표. 모델이 각 쿼리에 대해 올바른 모드(think vs no_think)를 선택했는지의 비율. 정답 모드는 두 모드로 여러 번 샘플링 후 리워드 비교로 결정.

오버싱킹간단한 질문에도 긴 추론 체인을 생성하는 LRM의 비효율 문제. '안녕하세요'에 수백 토큰을 쓰는 것처럼, 불필요한 thinking이 응답 지연과 비용 증가로 이어짐.

Test-Time Scaling모델 파라미터를 늘리는 대신 추론 시점에 더 많은 계산(긴 CoT, 여러 번 샘플링 등)을 투입해 성능을 높이는 전략. o1 계열이 대표적.

Related Resources

Original Abstract (Expand)

Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model's capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.