Triplet이 Pair보다 낫다: LLM을 위한 안정적이고 효과적인 Self-Play Fine-Tuning

Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs

Jan 13, 2026•Yibo Wang, Hailong Sun, Qing-Guo Chen +4•View PDF

TL;DR Highlight

레이블 데이터가 부족할 때, 모델이 자기 자신과 겨루며 학습하는 SPIN의 불안정성을 '과거 응답 앵커' 추가로 해결한 T-SPIN 제안.

Who Should Read

레이블 데이터가 적은 환경에서 LLM 파인튜닝 전략을 고민하는 ML 엔지니어나 연구자. 특히 SFT 대비 데이터 효율을 높이고 싶은 팀.

Core Mechanics

기존 SPIN의 문제: 반복 학습할수록 합성 응답이 실제 응답에 가까워지면 학습 신호가 사라지고 성능이 오르락내리락 불안정해짐
SPIN의 두 번째 문제: 학습 보상(reward)과 실제 생성 확률(log-likelihood)이 엇박자 — 보상이 높아도 생성 우선순위가 낮아지는 역설 발생
T-SPIN 핵심 아이디어: (실제 응답, 최근 합성 응답, 초기 모델이 만든 proto-synthetic 응답) 3개짜리 triplet으로 학습 — 최근 신호가 약해져도 초기 대비 '역사적 이득'이 학습을 이어감
엔트로피 제약 도입으로 reference policy 없이도 학습 가능 → 보상과 생성 확률이 일치해 misalignment 해소
Zephyr-7B, Mistral-7B-v0.1 기준으로 GSM8K +14.82점, IFEval +28.32점 향상, 4번 반복 내내 안정적으로 개선
데이터 50k만으로 200k 풀셋 SFT(42.01%)를 뛰어넘는 42.56% 평균 달성 — 즉, SFT 대비 25% 데이터로 동등 이상 성능

Evidence

Zephyr-7B 기준 T-SPIN 4회 반복 후 평균 43.47% vs SPIN 40.62% vs SFT(200k) 42.01%
GSM8K에서 SPIN은 iter3 피크 후 하락(33.32→35.54 오르내림), T-SPIN은 36.20→40.67로 단조 증가
T-SPIN 50k 데이터(42.56%)가 SFT 200k(42.01%)를 앞섬 — 데이터 4배 차이를 self-play로 극복
Mistral-7B에서도 T-SPIN iter4 평균 45.02% vs SPIN 42.32% vs SFT 44.17% — 다른 베이스 모델에서도 재현

How to Apply

SFT 데이터가 부족한 경우(전체의 25~50% 수준만 있을 때), 초기 모델로 proto-synthetic 응답을 한 번 생성해두고 triplet 학습으로 대체할 수 있음
DPO/SPIN 기반 학습 파이프라인이 반복 학습 후 성능이 요동친다면, reference policy를 제거하고 아래 코드처럼 current_loss + beta * history_loss 형태로 손실 함수를 교체하면 안정화 가능
하이퍼파라미터 α(0.1~1.0), β(0.1~1.0) 범위에서 민감도가 낮아 기본값 α=1.0, β=0.1로 시작해도 무방

Code Example

snippet

import torch.nn.functional as F

def tspin_loss(alpha, beta, policy_real_logps, policy_generated_logps, policy_proto_logps):
    """
    alpha: 정규화 계수 (기본값 1.0)
    beta:  현재/역사적 이득 균형 파라미터 (기본값 0.1)
    policy_real_logps:      실제(annotated) 응답의 log prob
    policy_generated_logps: 직전 iteration 합성 응답의 log prob
    policy_proto_logps:     초기 모델이 생성한 proto-synthetic 응답의 log prob
    """
    # 현재 이득: 실제 응답 vs 최근 합성 응답
    current_advantage = policy_real_logps - policy_generated_logps
    # 역사적 이득: 최근 합성 응답 vs 초기 합성 응답
    history_advantage = policy_generated_logps - policy_proto_logps

    current_rewards = alpha * current_advantage
    history_rewards = alpha * history_advantage

    current_loss = -F.logsigmoid(current_rewards)
    history_loss = -F.logsigmoid(history_rewards)

    losses = current_loss + beta * history_loss
    return losses.mean()

Terminology

SPIN모델이 자기 자신의 이전 버전과 대결하며 학습하는 기법. 바둑 AI처럼 '나 vs 과거의 나' 구도로 반복 훈련.

Self-Play Fine-Tuning전문가 레이블 없이 모델 스스로 생성한 데이터로 자신을 개선하는 학습 방식. 체스 AI가 스스로 대국을 두며 강해지는 것과 유사.

SFT정답 예시(레이블 데이터)를 보여주고 따라하게 하는 지도 학습. 학교에서 모범 답안 보고 암기하는 방식.

log-likelihood모델이 특정 문장을 얼마나 '자연스럽다'고 생각하는지의 점수. 높을수록 그 답변을 더 자주 생성함.

Reference Policy학습 기준점이 되는 이전 버전의 모델. SPIN에서는 이걸 쓰다가 보상과 생성 확률이 어긋나는 문제가 생김.

Entropy Constraint모델의 답변 다양성을 유지하도록 강제하는 장치. 너무 한 가지 답만 출력하지 않도록 다양성에 패널티를 줌.

Proto-Synthetic Response학습 맨 처음(iteration 0)의 모델이 생성한 응답. T-SPIN에서 '얼마나 발전했는지' 측정하는 기준점 역할.

Related Resources

Original Abstract (Expand)

Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.