DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Mar 13, 2026•Ruiyao Xu, Noelle I. Samia, Han Liu•View PDF

TL;DR Highlight

A framework that auto-generates specialized fine-tuning data for finance, medicine, math and more from just a task definition — no human labeling needed.

Who Should Read

ML teams fine-tuning LLMs for specialized domains who need training data but can't afford extensive human labeling, and researchers working on automated dataset generation.

Core Mechanics

Proposed a fully automated framework for generating domain-specific fine-tuning data from a task definition alone
Works across diverse specialized domains: finance, medicine, mathematics, legal, and more
Pipeline generates diverse, high-quality instruction-response pairs without human annotation
Uses a multi-stage process: task decomposition, example generation, quality filtering, deduplication
Generated data is competitive with human-curated domain data for fine-tuning performance
Dramatically reduces the cost and time for creating domain-specific training datasets

Evidence

Models fine-tuned on generated data match or outperform those trained on human-curated data on domain benchmarks
Data generation cost is orders of magnitude lower than human annotation
Quality filtering step removes ~30-40% of generated samples, significantly improving training data quality
Results validated across finance, medical QA, and math reasoning domains

How to Apply

Write a clear task definition specifying the domain, expected input format, and desired output format
Run the generation pipeline to produce a large diverse dataset, then apply quality filtering
Fine-tune your base model on the generated data — works best with iterative generation where the model's own outputs seed the next round

Code Example

snippet

# DS²-INSTRUCT Core Prompt Pattern Examples

# Step 1: Initial Keyword Generation
initial_keyword_prompt = """
Task Context: You are an expert in {domain}.
Task Description: {task_description}

Instructions: Generate 50 core keywords that represent the most essential concepts for this task.
Requirements:
- List exactly 50 core concepts separated by commas
- Use underscores for multi-word concepts (e.g., asset_valuation)
- Provide only the comma-separated list without any other text
Core Keywords:
"""

# Step 2: Bidirectional Keyword Expansion
bidirectional_expansion_prompt = """
Task Context: You are an expert in the domain related to: {task_description}
Sample Keywords: {sampled_keywords}

Instructions: Based on the sample keywords, generate new concepts in two directions:
1. Prerequisite Concepts: fundamental concepts learners must understand BEFORE the sample keywords
2. Advanced Concepts: specialized topics that BUILD UPON the sample keywords

Requirements:
- Generate 5 concepts for each direction
- Use underscores for multi-word concepts
- Provide comma-separated lists
"""

# Step 3: Bloom's Taxonomy-based Instruction Generation
bloom_levels = {
    "Remembering": "recall of factual knowledge, definitions, basic concepts",
    "Understanding": "conceptual understanding, explanation of relationships",
    "Applying": "practical use of methods, real-world application",
    "Analyzing": "breaking down complex ideas, identifying patterns",
    "Evaluating": "critical judgment, validation, justification of decisions",
    "Creating": "original thinking, synthesis of ideas, novel applications"
}

instruction_gen_prompt = """
Task Description: {task_description}
Keyword: {keyword}
Question Type: {cognitive_level} - {cognitive_level_description}

Generate a high-quality question that precisely targets the keyword and question type.
Directly output the question only.
Generated Question:
"""

# Step 4: Self-consistency Filtering
# Generate N=5 responses and retain only those with agreement at or above threshold τ=3/5
def self_consistency_filter(instruction, model, N=5, tau=0.6):
    responses = [model.generate(instruction) for _ in range(N)]
    answers = [extract_answer(r) for r in responses]
    from collections import Counter
    most_common_answer, count = Counter(answers).most_common(1)[0]
    vote_ratio = count / N
    if vote_ratio >= tau:
        return most_common_answer  # Retain as high-quality data
    else:
        return None  # Remove via filtering

Terminology

Domain-Specific Fine-TuningAdapting a general LLM to perform well on a specific professional domain by training on curated domain data.

Instruction TuningFine-tuning format where training examples consist of a task instruction paired with an expected response.

Quality FilteringAutomated processes to remove low-quality generated examples before they enter the training set.

Synthetic Data GenerationUsing AI models to create training data rather than relying on human-authored examples.

Related Resources

DS²-INSTRUCT GitHub

Original Abstract (Expand)

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.