Verifiable Format Control for Large Language Model Generations

Feb 6, 2025•Zhaoyang Wang, Jinqi Jiang, Huichi Zhou +4•View PDF

TL;DR Highlight

Solving the problem of 7B LLMs failing to follow format instructions like JSON — using Python verification function-based datasets and progressive training.

Who Should Read

Backend developers who need to parse LLM API responses as JSON or specific formats, or ML engineers fine-tuning small open-source LLMs for production deployment.

Core Mechanics

7B-class open-source LLMs (Mistral, LLaMA-2/3) poorly follow detailed format instructions unlike GPT-4 — especially when 2-3 constraints are combined
GPT-4o-based format verification: 70% accuracy, $2.38/200 samples. Python function: 100% accuracy, $0, 100x faster
VFF (Verifiable Format Following) dataset: 60 meta-constraints + 52K question combinations, automatically verifiable by Python bool functions
Self-improvement approach: model scores its own generated responses via Python functions for SFT + DPO training data — no external LLM API calls needed
Progressive training level-1 (1 constraint) > level-2 > level-3 (3 constraints) enables LLaMA-3-8B to exceed GPT-4-turbo at level-3 (38.36% vs 35.31%)
GPT-4o itself shows inconsistency — up to 25-52% different judgments on the same format question

Evidence

LLaMA-3-8B after training: VFF level-3 accuracy 15.81% (base) to 38.36% (trained), exceeding GPT-4-turbo 35.31%
Python-based vs GPT-4o judgment: accuracy 100% vs 70%, processing time 0.52s vs 205.10s, cost $0 vs $2.383/200 samples
GPT-4o format judgment consistency: 48% inconsistency rate at temperature 1.0 (50 repeated queries), 10.15 judgment flips
SFT+DPO combo progressive training consistently outperforms DPO-Only: 38.36% vs 17.95% at level-3

How to Apply

For fine-tuning small LLMs for JSON output: filter 'Limited Structure' constraint samples from the VFF dataset (huggingface.co/datasets/jinqij/VFF) as SFT training data.
For custom format constraints: define meta-constraints as 'constraint text + variable candidates + Python verification function', auto-generate DPO pairs by scoring model outputs with Python.
Already using LLaMA-Factory: apply the paper's training settings (LoRA rank=64, alpha=128, lr=5e-6, AdamW, cosine scheduler, 8 epochs) with SFTTrainer then DPOTrainer.

Code Example

snippet

# Python verification function example - automatically checks JSON format compliance
import json

def verify_json_format(response_text, vars, type=0):
    try:
        response_text = fr'''{response_text}'''
        json_object = json.loads(response_text)
    except ValueError:
        return False
    return True

# Word limit verification function
def verify_word_limit(response_text, vars, type=0):
    word_limit = int(vars[0])  # vars[0] = 30, 50, 100, etc.
    word_count = len(response_text.split())
    meets_criteria = word_count <= word_limit
    if type == 0:
        return meets_criteria
    else:
        if meets_criteria:
            return 1
        else:
            return 1 - (word_count - word_limit) / word_limit

# Combine multiple constraints with AND (level-c verification)
def verify_all_constraints(response, constraint_fns, vars_list):
    # Must pass all constraints for I=1
    return all(fn(response, v) for fn, v in zip(constraint_fns, vars_list))

# Usage example
response = '{"answer": "Paris"}'
print(verify_json_format(response, []))  # True

response_long = "This is a very long response with many many words"
print(verify_word_limit(response_long, [5]))  # False

Terminology

DPOShow the model pairs of 'this response is better than that one' to learn preferences.

SFTSupervised Fine-Tuning. Show gold-standard examples directly and have the model imitate them.

LoRAA fine-tuning technique adding only small adapter layers without retraining the whole model.

meta-constraintA constraint template before specific values are filled in. E.g., 'Respond in VAR1 language' where VAR1 can be filled with 'English', 'Spanish', etc.

self-improvementA paradigm where the model generates its own training data and learns from it. No external humans or GPT-4 needed.

IFEvalA benchmark automatically evaluating how well LLMs follow format instructions using Python code.

constrained decodingA decoding technique forcing the LLM to comply with grammar or schema. Tokens violating the format are blocked entirely.

Related Resources

VFF Dataset (HuggingFace)

Original Abstract (Expand)

Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.