LIMO: 적은 데이터로 추론 능력 끌어내기

LIMO: Less is More for Reasoning

Feb 5, 2025•Yixin Ye, Zhen Huang, Yang Xiao +3•View PDF

TL;DR Highlight

800개 예제만으로 SFT했더니 AIME24 63.3% 달성 — 기존 방법의 1% 데이터로 o1-preview도 넘겼다.

Who Should Read

LLM 파인튜닝 파이프라인을 운영하거나 수학/추론 특화 모델을 만들려는 ML 엔지니어. 특히 학습 데이터 수집 비용을 줄이면서 성능을 유지하고 싶은 팀.

Core Mechanics

Qwen2.5-32B-Instruct를 800개 샘플로 SFT하면 AIME24 63.3% 달성 — 같은 베이스 모델에 NuminaMath 100k개 넣은 것(6.5%)보다 10배 높음
데이터 양보다 질이 결정적: 114k짜리 OpenThoughts(58.3%)보다 800개 LIMO(78.1% 평균)가 OOD 포함 전 벤치마크에서 우세
고품질 reasoning chain의 특징 4가지 — ① 단계별 상세 설명, ② 중간 결과 자기검증('check', 'verify'), ③ 탐색적 표현('perhaps', 'might'), ④ 논리 연결어('therefore', 'since')
400개만 써도 기본 모델 대비 AIME24 16.5→57.5로 급상승 — 데이터 수가 아니라 난이도·품질 선별이 핵심
사전학습 품질이 전제조건: 동일 800개 데이터로 Qwen1.5-32B는 9.2%, Qwen2.5-32B는 63.3% — 모델의 수학 지식 기반이 충분해야 소량 SFT가 통함
훈련 분포 밖(OOD) 일반화도 강함: 중국어 수능·대학원 시험·GPQA 등 미학습 도메인에서도 100배 많은 데이터로 학습한 모델 상회

Evidence

AIME24: LIMO 63.3% vs NuminaMath-100k 6.5% vs OpenAI-o1-preview 44.6% vs QwQ-32B-Preview 50.0%
MATH500: LIMO 95.6% vs NuminaMath-100k 59.2% vs QwQ-32B-Preview 89.8%
전체 벤치마크 평균(OOD 포함): LIMO 78.1% vs OpenThoughts-114k 58.3% vs NuminaMath-100k 32.3%
400개 샘플만으로 AIME24 57.5% 달성 — 기본 모델(16.5%) 대비 41p 향상, 800개 이후 수익 체감 시작

How to Apply

문제 난이도 필터링 2단계: 먼저 Qwen2.5-Math-7B-Instruct로 쉬운 문제 제거 → DeepSeek-R1-Distill-Qwen-32B로 32번 샘플링해서 1~3번만 성공하는 문제만 남기기. 이 기준이면 수십만 개 풀에서 약 2천 개 수준으로 압축됨
Reasoning chain 품질 자동 채점: 길이(30%) + 검증 표현 빈도(20%) + 탐색 표현 빈도(25%) + 논리 연결어(25%)를 텍스트 길이로 정규화해서 점수화 → 상위 800개만 SFT 데이터로 사용
베이스 모델 선택이 데이터보다 중요: 수학 콘텐츠 사전학습이 충분한 모델(Qwen2.5, DeepSeek 계열 등)을 써야 소량 SFT 전략이 효과적 — 사전학습이 약한 모델에 800개 넣으면 9% 수준에 그침

Code Example

snippet

# LIMO 스타일 reasoning chain 품질 채점 함수
import re

def score_reasoning_chain(text: str) -> float:
    length = len(text.split())
    if length == 0:
        return 0.0

    # 자기검증 표현 (20%)
    verify_words = ['check', 'verify', 'confirm', 'validate', 'let me verify']
    verify_score = sum(text.lower().count(w) for w in verify_words) / length

    # 탐색적 표현 (25%)
    explore_words = ['perhaps', 'might', 'maybe', 'alternatively', 'let me try']
    explore_score = sum(text.lower().count(w) for w in explore_words) / length

    # 논리 연결어 (25%)
    logic_words = ['therefore', 'since', 'because', 'thus', 'hence', 'so']
    logic_score = sum(text.lower().count(w) for w in logic_words) / length

    # 길이 점수 (30%) — 긴 reasoning이 더 elaborated
    length_score = min(length / 2000, 1.0)  # 2000 토큰 기준 정규화

    total = (
        length_score * 0.30 +
        verify_score * 1000 * 0.20 +  # 빈도 스케일 조정
        explore_score * 1000 * 0.25 +
        logic_score * 1000 * 0.25
    )
    return total

# 사용 예시
samples = [(q, r, a) for q, r, a in dataset]
scored = [(score_reasoning_chain(r), q, r, a) for q, r, a in samples]
scored.sort(reverse=True)
top_800 = scored[:800]  # 상위 800개만 SFT 데이터로

Terminology

SFTSupervised Fine-Tuning의 약자. 정답이 있는 예제를 보여주고 따라하게 하는 학습법. 모범 답안 보고 따라 푸는 것과 같음.

CoTChain of Thought. '1단계: ... 2단계: ...' 식으로 중간 풀이 과정을 명시적으로 쓰게 하는 기법. 최종 답만 내놓는 것보다 정확도가 높아짐.

OODOut-of-Distribution. 학습 데이터와 분포가 다른 새로운 문제 유형. OOD 성능이 높다 = 외운 게 아니라 진짜 이해한 것.

pass@1한 번 생성 시도했을 때 정답을 맞출 확률. 여러 번 시도하면 pass@k. 실제 서비스에서 체감하는 정확도와 가장 가까운 지표.

AIMEAmerican Invitational Mathematics Examination. 미국 수학 올림피아드 예선. 전문 수학 경시 수준으로 LLM 추론 능력 상한선을 재는 벤치마크.

inference-time scaling모델 크기나 학습 데이터를 늘리는 대신, 추론할 때 더 많은 토큰(생각 과정)을 쓰게 해서 성능을 높이는 방향. o1, DeepSeek-R1이 이 방식.

DeepSpeed ZeRO-3대형 모델을 여러 GPU에 쪼개서 학습할 때 메모리를 효율적으로 분산하는 기법. 32B 파라미터 풀파인튜닝을 가능하게 해주는 인프라 도구.

Related Resources

https://github.com/GAIR-NLP/LIMO

Original Abstract (Expand)

We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3\% accuracy on AIME24 and 95.6\% on MATH500, surpassing previous fine-tuned models (6.5\% on AIME24, 59.2\% on MATH500) while using only 1\% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8\% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model's pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as"cognitive templates"that guide reasoning.