FMBench: Adaptive Large Language Model Output Formatting

Feb 6, 2026•Yaoting Wang, Yun Zhou, Henghui Ding•View PDF

TL;DR Highlight

A benchmark measuring how well LLMs follow Markdown formatting rules, plus SFT then GRPO fine-tuning to improve format compliance.

Who Should Read

Backend/AI engineers running services that parse or render LLM responses as Markdown. Especially relevant for chatbot, document automation, or tool integration pipelines where malformed Markdown breaks downstream processing.

Core Mechanics

Most LLMs have significant Markdown compliance gaps — incorrect heading levels, inconsistent list nesting, and broken table formatting are the most common issues
SFT (supervised fine-tuning) alone improves format compliance but overfits to training Markdown styles
GRPO (group relative policy optimization) after SFT achieves better generalization — the model learns formatting rules rather than memorizing examples
Smaller models (7B) can match larger models (70B) on Markdown compliance after GRPO fine-tuning
Format compliance and content quality are not correlated — a model can be format-perfect but content-poor

Evidence

Baseline GPT-4 Markdown compliance: 71.3% on the benchmark vs. 94.2% after GRPO fine-tuning on a 7B model
SFT-only fine-tuning achieves 87.6% compliance but drops to 79.1% on out-of-distribution Markdown styles
GRPO-tuned 7B model achieves 93.8% compliance, within 0.4%p of the fine-tuned 70B model (94.2%)
Content quality score (human eval) unchanged before and after fine-tuning: 4.1/5 both conditions

How to Apply

Run your current LLM responses through the benchmark's compliance checker to identify which Markdown rules it violates most
If you need a model to reliably produce specific Markdown structures (e.g., always use level-2 headings, never mix list styles), GRPO fine-tuning on format compliance is more effective than prompt engineering alone
For quick wins without fine-tuning, add explicit Markdown format rules to your system prompt with examples of correct and incorrect formatting

Code Example

snippet

# FMBench-style Markdown structure validation example (Python)
import re

def check_markdown_structure(text: str) -> dict:
    issues = []
    
    # Check code fence balance
    code_fences = re.findall(r'^```', text, re.MULTILINE)
    if len(code_fences) % 2 != 0:
        issues.append('Unbalanced code fences')
    
    # Check heading hierarchy consistency
    headings = re.findall(r'^(#{1,6})\s', text, re.MULTILINE)
    levels = [len(h) for h in headings]
    for i in range(1, len(levels)):
        if levels[i] - levels[i-1] > 1:
            issues.append(f'Heading level jump: h{levels[i-1]} -> h{levels[i]}')
    
    # Check list indentation (basic)
    list_items = re.findall(r'^(\s*)[\-\*\+]\s', text, re.MULTILINE)
    indent_levels = [len(s) for s in list_items]
    for i in range(1, len(indent_levels)):
        if indent_levels[i] - indent_levels[i-1] > 2:
            issues.append('Excessive list indent jump')
    
    return {'valid': len(issues) == 0, 'issues': issues}

# Usage example
result = check_markdown_structure(llm_output)
if not result['valid']:
    print('Format issues:', result['issues'])
    # Trigger regeneration or post-processing

Terminology

GRPOGroup Relative Policy Optimization — a reinforcement learning method for fine-tuning LLMs using relative quality comparisons within a group of outputs.

SFTSupervised Fine-Tuning — training a model on labeled examples to produce target outputs.

Markdown complianceHow closely an LLM's output follows standard Markdown syntax rules — correct heading hierarchy, consistent list formatting, valid table structure, etc.

format generalizationThe ability to apply formatting rules correctly to new situations not seen during training.

Related Resources

https://github.com/FudanCVL/FMBench

Original Abstract (Expand)

Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.