QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Mar 12, 2026•Jiayin Lei, Ming Ma, Yunxi Duan +2•View PDF

TL;DR Highlight

A reverse data selection technique that selects only 25% of synthetic code training data and achieves the same performance as training on the full dataset.

Who Should Read

ML engineers managing large synthetic code training datasets, and researchers working on data-efficient training for code LLMs.

Core Mechanics

Proposed a reverse data selection approach: instead of selecting 'what looks good', filter out what looks bad and keep the rest
Training on just 25% of synthetic code data selected by this method matches full-dataset performance
The selection criterion is based on data quality signals derived from model-generated feedback
Works well for synthetic data where quality varies significantly across samples
Reduces training compute by 75% while maintaining downstream code generation quality
The method is model-agnostic and can be applied to any synthetic code dataset

Evidence

25% data subset achieves equivalent performance to 100% data training on code generation benchmarks
4x reduction in training compute with no measurable quality loss
Reverse selection outperforms standard forward selection methods at the same data budget
Results hold across multiple code LLM architectures and sizes

How to Apply

Apply to any synthetic code dataset to cut training time by ~75% — especially valuable for datasets generated by expensive frontier models
Run the quality scoring pass first, identify the bottom quartile of samples by quality signal, remove them, then train normally
Combine with data deduplication for maximum efficiency gains before the reverse selection pass

Code Example

snippet

Terminology

Data SelectionChoosing a subset of training data that maximizes model performance — a key technique for efficient training.

Synthetic DataTraining data generated by AI models rather than collected from humans, increasingly used to scale dataset size.

Reverse SelectionFiltering out low-quality samples rather than positively identifying high-quality ones — often more robust for noisy datasets.

Code GenerationLLM capability to write functional code from natural language descriptions or specifications.

Related Resources

Original Abstract (Expand)

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.