DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning
TL;DR Highlight
A framework that auto-generates specialized fine-tuning data for finance, medicine, math and more from just a task definition — no human labeling needed.
Who Should Read
ML teams fine-tuning LLMs for specialized domains who need training data but can't afford extensive human labeling, and researchers working on automated dataset generation.
Core Mechanics
- Proposed a fully automated framework for generating domain-specific fine-tuning data from a task definition alone
- Works across diverse specialized domains: finance, medicine, mathematics, legal, and more
- Pipeline generates diverse, high-quality instruction-response pairs without human annotation
- Uses a multi-stage process: task decomposition, example generation, quality filtering, deduplication
- Generated data is competitive with human-curated domain data for fine-tuning performance
- Dramatically reduces the cost and time for creating domain-specific training datasets
Evidence
- Models fine-tuned on generated data match or outperform those trained on human-curated data on domain benchmarks
- Data generation cost is orders of magnitude lower than human annotation
- Quality filtering step removes ~30-40% of generated samples, significantly improving training data quality
- Results validated across finance, medical QA, and math reasoning domains
How to Apply
- Write a clear task definition specifying the domain, expected input format, and desired output format
- Run the generation pipeline to produce a large diverse dataset, then apply quality filtering
- Fine-tune your base model on the generated data — works best with iterative generation where the model's own outputs seed the next round
Code Example
# DS²-INSTRUCT Core Prompt Pattern Examples
# Step 1: Initial Keyword Generation
initial_keyword_prompt = """
Task Context: You are an expert in {domain}.
Task Description: {task_description}
Instructions: Generate 50 core keywords that represent the most essential concepts for this task.
Requirements:
- List exactly 50 core concepts separated by commas
- Use underscores for multi-word concepts (e.g., asset_valuation)
- Provide only the comma-separated list without any other text
Core Keywords:
"""
# Step 2: Bidirectional Keyword Expansion
bidirectional_expansion_prompt = """
Task Context: You are an expert in the domain related to: {task_description}
Sample Keywords: {sampled_keywords}
Instructions: Based on the sample keywords, generate new concepts in two directions:
1. Prerequisite Concepts: fundamental concepts learners must understand BEFORE the sample keywords
2. Advanced Concepts: specialized topics that BUILD UPON the sample keywords
Requirements:
- Generate 5 concepts for each direction
- Use underscores for multi-word concepts
- Provide comma-separated lists
"""
# Step 3: Bloom's Taxonomy-based Instruction Generation
bloom_levels = {
"Remembering": "recall of factual knowledge, definitions, basic concepts",
"Understanding": "conceptual understanding, explanation of relationships",
"Applying": "practical use of methods, real-world application",
"Analyzing": "breaking down complex ideas, identifying patterns",
"Evaluating": "critical judgment, validation, justification of decisions",
"Creating": "original thinking, synthesis of ideas, novel applications"
}
instruction_gen_prompt = """
Task Description: {task_description}
Keyword: {keyword}
Question Type: {cognitive_level} - {cognitive_level_description}
Generate a high-quality question that precisely targets the keyword and question type.
Directly output the question only.
Generated Question:
"""
# Step 4: Self-consistency Filtering
# Generate N=5 responses and retain only those with agreement at or above threshold τ=3/5
def self_consistency_filter(instruction, model, N=5, tau=0.6):
responses = [model.generate(instruction) for _ in range(N)]
answers = [extract_answer(r) for r in responses]
from collections import Counter
most_common_answer, count = Counter(answers).most_common(1)[0]
vote_ratio = count / N
if vote_ratio >= tau:
return most_common_answer # Retain as high-quality data
else:
return None # Remove via filteringTerminology
Related Resources
Original Abstract (Expand)
Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.