QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
TL;DR Highlight
A reverse data selection technique that selects only 25% of synthetic code training data and achieves the same performance as training on the full dataset.
Who Should Read
ML engineers managing large synthetic code training datasets, and researchers working on data-efficient training for code LLMs.
Core Mechanics
- Proposed a reverse data selection approach: instead of selecting 'what looks good', filter out what looks bad and keep the rest
- Training on just 25% of synthetic code data selected by this method matches full-dataset performance
- The selection criterion is based on data quality signals derived from model-generated feedback
- Works well for synthetic data where quality varies significantly across samples
- Reduces training compute by 75% while maintaining downstream code generation quality
- The method is model-agnostic and can be applied to any synthetic code dataset
Evidence
- 25% data subset achieves equivalent performance to 100% data training on code generation benchmarks
- 4x reduction in training compute with no measurable quality loss
- Reverse selection outperforms standard forward selection methods at the same data budget
- Results hold across multiple code LLM architectures and sizes
How to Apply
- Apply to any synthetic code dataset to cut training time by ~75% — especially valuable for datasets generated by expensive frontier models
- Run the quality scoring pass first, identify the bottom quartile of samples by quality signal, remove them, then train normally
- Combine with data deduplication for maximum efficiency gains before the reverse selection pass
Code Example
Terminology
Related Resources
Original Abstract (Expand)
Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.