FMBench: Adaptive Large Language Model Output Formatting
TL;DR Highlight
A benchmark measuring how well LLMs follow Markdown formatting rules, plus SFT then GRPO fine-tuning to improve format compliance.
Who Should Read
Backend/AI engineers running services that parse or render LLM responses as Markdown. Especially relevant for chatbot, document automation, or tool integration pipelines where malformed Markdown breaks downstream processing.
Core Mechanics
- Most LLMs have significant Markdown compliance gaps — incorrect heading levels, inconsistent list nesting, and broken table formatting are the most common issues
- SFT (supervised fine-tuning) alone improves format compliance but overfits to training Markdown styles
- GRPO (group relative policy optimization) after SFT achieves better generalization — the model learns formatting rules rather than memorizing examples
- Smaller models (7B) can match larger models (70B) on Markdown compliance after GRPO fine-tuning
- Format compliance and content quality are not correlated — a model can be format-perfect but content-poor
Evidence
- Baseline GPT-4 Markdown compliance: 71.3% on the benchmark vs. 94.2% after GRPO fine-tuning on a 7B model
- SFT-only fine-tuning achieves 87.6% compliance but drops to 79.1% on out-of-distribution Markdown styles
- GRPO-tuned 7B model achieves 93.8% compliance, within 0.4%p of the fine-tuned 70B model (94.2%)
- Content quality score (human eval) unchanged before and after fine-tuning: 4.1/5 both conditions
How to Apply
- Run your current LLM responses through the benchmark's compliance checker to identify which Markdown rules it violates most
- If you need a model to reliably produce specific Markdown structures (e.g., always use level-2 headings, never mix list styles), GRPO fine-tuning on format compliance is more effective than prompt engineering alone
- For quick wins without fine-tuning, add explicit Markdown format rules to your system prompt with examples of correct and incorrect formatting
Code Example
# FMBench-style Markdown structure validation example (Python)
import re
def check_markdown_structure(text: str) -> dict:
issues = []
# Check code fence balance
code_fences = re.findall(r'^```', text, re.MULTILINE)
if len(code_fences) % 2 != 0:
issues.append('Unbalanced code fences')
# Check heading hierarchy consistency
headings = re.findall(r'^(#{1,6})\s', text, re.MULTILINE)
levels = [len(h) for h in headings]
for i in range(1, len(levels)):
if levels[i] - levels[i-1] > 1:
issues.append(f'Heading level jump: h{levels[i-1]} -> h{levels[i]}')
# Check list indentation (basic)
list_items = re.findall(r'^(\s*)[\-\*\+]\s', text, re.MULTILINE)
indent_levels = [len(s) for s in list_items]
for i in range(1, len(indent_levels)):
if indent_levels[i] - indent_levels[i-1] > 2:
issues.append('Excessive list indent jump')
return {'valid': len(issues) == 0, 'issues': issues}
# Usage example
result = check_markdown_structure(llm_output)
if not result['valid']:
print('Format issues:', result['issues'])
# Trigger regeneration or post-processingTerminology
Related Resources
Original Abstract (Expand)
Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.