Verifiable Format Control for Large Language Model Generations
TL;DR Highlight
Solving the problem of 7B LLMs failing to follow format instructions like JSON — using Python verification function-based datasets and progressive training.
Who Should Read
Backend developers who need to parse LLM API responses as JSON or specific formats, or ML engineers fine-tuning small open-source LLMs for production deployment.
Core Mechanics
- 7B-class open-source LLMs (Mistral, LLaMA-2/3) poorly follow detailed format instructions unlike GPT-4 — especially when 2-3 constraints are combined
- GPT-4o-based format verification: 70% accuracy, $2.38/200 samples. Python function: 100% accuracy, $0, 100x faster
- VFF (Verifiable Format Following) dataset: 60 meta-constraints + 52K question combinations, automatically verifiable by Python bool functions
- Self-improvement approach: model scores its own generated responses via Python functions for SFT + DPO training data — no external LLM API calls needed
- Progressive training level-1 (1 constraint) > level-2 > level-3 (3 constraints) enables LLaMA-3-8B to exceed GPT-4-turbo at level-3 (38.36% vs 35.31%)
- GPT-4o itself shows inconsistency — up to 25-52% different judgments on the same format question
Evidence
- LLaMA-3-8B after training: VFF level-3 accuracy 15.81% (base) to 38.36% (trained), exceeding GPT-4-turbo 35.31%
- Python-based vs GPT-4o judgment: accuracy 100% vs 70%, processing time 0.52s vs 205.10s, cost $0 vs $2.383/200 samples
- GPT-4o format judgment consistency: 48% inconsistency rate at temperature 1.0 (50 repeated queries), 10.15 judgment flips
- SFT+DPO combo progressive training consistently outperforms DPO-Only: 38.36% vs 17.95% at level-3
How to Apply
- For fine-tuning small LLMs for JSON output: filter 'Limited Structure' constraint samples from the VFF dataset (huggingface.co/datasets/jinqij/VFF) as SFT training data.
- For custom format constraints: define meta-constraints as 'constraint text + variable candidates + Python verification function', auto-generate DPO pairs by scoring model outputs with Python.
- Already using LLaMA-Factory: apply the paper's training settings (LoRA rank=64, alpha=128, lr=5e-6, AdamW, cosine scheduler, 8 epochs) with SFTTrainer then DPOTrainer.
Code Example
# Python verification function example - automatically checks JSON format compliance
import json
def verify_json_format(response_text, vars, type=0):
try:
response_text = fr'''{response_text}'''
json_object = json.loads(response_text)
except ValueError:
return False
return True
# Word limit verification function
def verify_word_limit(response_text, vars, type=0):
word_limit = int(vars[0]) # vars[0] = 30, 50, 100, etc.
word_count = len(response_text.split())
meets_criteria = word_count <= word_limit
if type == 0:
return meets_criteria
else:
if meets_criteria:
return 1
else:
return 1 - (word_count - word_limit) / word_limit
# Combine multiple constraints with AND (level-c verification)
def verify_all_constraints(response, constraint_fns, vars_list):
# Must pass all constraints for I=1
return all(fn(response, v) for fn, v in zip(constraint_fns, vars_list))
# Usage example
response = '{"answer": "Paris"}'
print(verify_json_format(response, [])) # True
response_long = "This is a very long response with many many words"
print(verify_word_limit(response_long, [5])) # FalseTerminology
Related Resources
Original Abstract (Expand)
Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.