QAQ: 양방향 Semantic Coherence로 고품질 합성 코드 데이터 선별하기

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Mar 12, 2026•Jiayin Lei, Ming Ma, Yunxi Duan +2•View PDF

TL;DR Highlight

합성 코드 학습 데이터의 25%만 골라도 전체 학습과 동일한 성능을 내는 역방향 데이터 선별 기법

Who Should Read

코드 생성 모델 파인튜닝을 위해 합성 데이터 품질 관리 방법을 고민하는 ML 엔지니어. 대규모 instruction-tuning 데이터셋에서 노이즈를 걸러내는 파이프라인을 설계하는 경우에 특히 유용.

Core Mechanics

기존 IFD(답변 생성 난이도 측정)와 반대 방향으로, '답변을 보고 질문을 얼마나 잘 예측할 수 있나(PPL(Q|A))'를 측정하는 RMI(Reverse Mutual Information) 지표 도입
RMI가 너무 낮으면 Q-A 쌍이 의미적으로 무관한 쓰레기 데이터, 너무 높으면 답변이 질문을 단순 복사/패러프레이즈한 결함 데이터 — 중간값(50~75%)이 최적 학습 신호
질문 복잡도 편향을 제거하기 위해 PPL(Q) 기준으로 10개 구간(decile)으로 나눠 구간 내 상대 순위를 사용하는 Stratified RMI 적용
강한 모델(DeepSeek-Coder-6.7B)과 약한 모델(Qwen3-0.6B)의 RMI 점수 차이(Cognitive Gap)가 큰 샘플만 선택 — 강한 모델은 인정하지만 약한 모델은 어려운, 실제 학습 가치 있는 데이터를 잡아냄
WarriorCoder 310K 데이터에서 25%만 선택해도 HumanEval+ 72.56으로 전체 학습(72.56)과 동일 성능, IFD(66.46) 대비 6점 이상 앞섬
RMI 점수는 파인튜닝 전후 Pearson ρ=0.9539로 매우 안정적 — 한 번 계산하면 재계산 없이 재사용 가능

Evidence

RMI 50-75% 구간 25% 데이터로 HumanEval+ 72.56 달성 — 전체 100% 데이터(72.56)와 동일, IFD 25%(66.46) 대비 +6.1점
Disagreement 기반 선택(Diff-High)이 Consensus 기반(Sum-High)보다 HumanEval+ 71.95 vs 68.90으로 +3.05점 우세
RMI와 IFD의 Spearman 상관계수 ρ=0.252로 두 지표가 서로 다른 데이터 품질 측면을 측정함을 확인
강한/약한 모델 Top-50% 선택 결과의 overlap은 76.1%지만, 동일 선택 전략 내 Diff-High vs Sum-High overlap은 13.85%에 불과

How to Apply

합성 코드 데이터셋이 있을 때: 강한 모델(6~7B 코더)과 약한 모델(0.5~1B)로 각 샘플의 PPL(Q)와 PPL(Q|A)를 계산하고, PPL(Q) 기준 10개 구간으로 나눠 구간 내 RMI 순위를 구한 뒤 두 모델의 순위 차이(rs - rw)가 높은 상위 25%만 학습에 사용
PPL(Q|A) 계산 시 단순 텍스트 이어붙이기 대신 'Given an answer, generate the most likely computer science question...' 형태의 역방향 프롬프트를 채팅 템플릿에 넣어야 정확한 RMI 측정 가능
WarriorCoder처럼 seedless 합성 파이프라인(Magpie 등)으로 대량 생성한 데이터에서 노이즈 제거 용도로 바로 적용 가능 — 학습 전 1회만 RMI 점수 계산하면 되므로 추가 비용은 샘플당 forward pass 2회

Code Example

snippet

Terminology

IFDInstruction-Following Difficulty의 약자. '질문을 줬을 때 모델이 답변을 생성하기 얼마나 어려운가'를 perplexity 비율로 측정하는 기존 데이터 품질 지표.

RMIReverse Mutual Information. 이 논문의 핵심 지표로, '답변을 봤을 때 원래 질문을 얼마나 잘 예측할 수 있나'를 측정. 답변이 질문을 잘 설명할수록 값이 높음.

Perplexity언어 모델이 텍스트를 얼마나 '놀랍게' 느끼는지 수치화한 것. 낮을수록 모델이 해당 텍스트를 자연스럽게 예측했다는 의미. PPL로 줄여씀.

Teacher-forcing모델이 자체 생성한 토큰 대신 실제 정답 토큰을 다음 입력으로 주면서 loss를 계산하는 학습/평가 방식. perplexity 계산에 주로 사용.

Seedless 합성 데이터사람이 만든 예시(seed) 없이 LLM에 직접 프롬프트를 넣어 instruction-response 쌍을 대량 생성하는 방식. 다양성은 높지만 노이즈도 많음.

Stratified RMI질문 복잡도(PPL(Q))가 서로 다른 샘플들을 직접 비교하는 편향을 없애기 위해, 비슷한 복잡도끼리 묶어서 구간 내 상대 순위로 비교하는 기법. 체육 수행평가를 학년별로 나눠서 등수 매기는 것과 유사.

Cognitive Gap강한 모델과 약한 모델이 같은 데이터를 다르게 평가하는 차이. 강한 모델만 좋다고 인식하는 샘플은 진짜 학습 가치가 있는 데이터일 가능성이 높다는 아이디어.

HumanEvalOpenAI가 만든 코드 생성 벤치마크. 함수 설명 보고 파이썬 코드 작성하는 164개 문제로 구성. pass@1은 첫 번째 시도에서 통과한 비율.

Related Resources

Original Abstract (Expand)

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.