DeepSeek-R1: Reinforcement Learning으로 LLM 추론 능력 끌어내기

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Jan 22, 2025•DeepSeek-AI, Daya Guo, Dejian Yang +195•View PDF

TL;DR Highlight

인간 레이블 없이 순수 RL만으로 OpenAI o1 수준의 추론 모델을 만들었고, 소형 모델 distillation까지 공개했다.

Who Should Read

LLM 기반 코드 생성·수학 풀이 서비스를 개발 중인 ML 엔지니어, 또는 추론 모델을 프로덕션에 붙이기 전에 성능·안전성을 파악하고 싶은 AI 앱 개발자.

Core Mechanics

SFT(정답 예시 학습) 없이 순수 RL만 돌렸더니 자기검증·반성·대안 탐색 같은 고급 추론 행동이 자연스럽게 나타남 (DeepSeek-R1-Zero)
AIME 2024 수학 경시 Pass@1이 15.6% → 77.9%로 상승, 인간 평균 점수도 넘어섬
언어 혼용·가독성 문제를 해결하기 위해 Cold Start → RL → SFT → RL 순의 4단계 파이프라인을 적용해 최종 DeepSeek-R1 완성
PPO 대비 value model이 필요 없는 GRPO(Group Relative Policy Optimization) 사용으로 메모리·연산 절약
DeepSeek-R1의 추론 데이터 80만 건으로 Qwen2.5·Llama-3.1 같은 오픈소스 베이스 모델을 fine-tune하면 1.5B 소형 모델도 GPT-4o를 수학 벤치마크에서 능가
Few-shot 프롬프트가 오히려 성능을 낮추므로, DeepSeek-R1엔 zero-shot으로 문제와 출력 형식만 지정하는 것이 최적

Evidence

AIME 2024 Pass@1: DeepSeek-R1 79.8% vs OpenAI o1-1217 79.2% vs GPT-4o 9.3%
Codeforces 퍼센타일: DeepSeek-R1 96.3% (레이팅 2029) — 인간 참가자 상위 3.7% 수준
MATH-500 Pass@1: DeepSeek-R1 97.3% vs GPT-4o 74.6% vs Claude-3.5-Sonnet 78.3%
distill 1.5B 모델(DeepSeek-R1-Distill-Qwen-1.5B)이 AIME Pass@1 28.9%로 GPT-4o(9.3%)를 3배 이상 앞서

How to Apply

추론이 필요한 API 호출 시 few-shot 예시를 제거하고 zero-shot + 출력 포맷 명시로 바꾸면 성능이 오른다. 예: '다음 문제를 풀고 최종 답을 \boxed{}에 넣어라'
소형 모델이 필요한 서비스라면 DeepSeek-R1-Distill-Qwen-7B(HuggingFace 공개)를 직접 사용하거나, 자체 도메인 데이터를 같은 방식으로 SFT distillation해 볼 수 있다
오픈소스 모델을 프로덕션에 올릴 때는 논문의 Risk Review Prompt(Listing 8)를 참고해 DeepSeek-V3 같은 판단 모델로 후처리 필터를 붙여야 jailbreak 취약점을 보완할 수 있다

Code Example

snippet

# DeepSeek-R1 zero-shot 추론 호출 예시
# pip install openai
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.deepseek.com"
)

# ✅ zero-shot: 문제 + 출력 형식만 지정
prompt = """
Solve the following problem step by step.
Put your final answer inside \\boxed{}.

Problem: Find all positive integers n such that n^2 + 1 is divisible by n + 1.
"""

response = client.chat.completions.create(
    model="deepseek-reasoner",  # DeepSeek-R1
    messages=[{"role": "user", "content": prompt}],
    # ❌ few-shot 예시 넣으면 오히려 성능 하락
)

print(response.choices[0].message.content)

Terminology

GRPO여러 답변을 한 묶음으로 뽑아서 서로 상대 비교로 점수를 매기는 RL 알고리즘. PPO보다 추가 value 모델이 필요 없어서 메모리를 아낄 수 있다.

SFT정답 예시를 보여주고 따라하게 만드는 학습법. 학교에서 '모범 풀이'를 보고 따라 푸는 것과 같다.

CoTChain-of-Thought. 최종 답 전에 중간 추론 과정을 쭉 쓰게 만드는 기법. '1단계, 2단계...' 식으로 생각을 펼치면 정확도가 올라간다.

RLHF사람이 '이 답변이 더 좋다'고 선택하면 그 선호를 보상으로 삼아 모델을 학습시키는 방법. 인간 취향을 주입하는 핵심 기술.

Distillation큰 모델(teacher)이 만든 정답 데이터로 작은 모델(student)을 fine-tune하는 기법. 선생님 노트 베껴 공부하는 것과 비슷.

Reward Hacking모델이 실제 문제를 잘 푸는 게 아니라 점수 주는 기준의 허점을 찾아 높은 점수만 받는 현상. 시험 족보만 외워 점수를 올리는 것과 같다.

MoEMixture-of-Experts. 모델 전체를 한꺼번에 쓰지 않고 입력에 따라 일부 전문가 모듈만 활성화하는 구조. 671B 파라미터 중 37B만 실제 연산에 사용.

Related Resources

Original Abstract (Expand)

General reasoning represents a long-standing and formidable challenge in artificial intelligence (AI). Recent breakthroughs, exemplified by large language models (LLMs)1,2 and chain-of-thought (CoT) prompting3, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent on extensive human-annotated demonstrations and the capabilities of models are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labelled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions and STEM fields, surpassing its counterparts trained through conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically used to guide and enhance the reasoning capabilities of smaller models. A new artificial intelligence model, DeepSeek-R1, is introduced, demonstrating that the reasoning abilities of large language models can be incentivized through pure reinforcement learning, removing the need for human-annotated demonstrations.