Show HN: A new benchmark for testing LLMs for deterministic outputs

TL;DR Highlight

Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.

Who Should Read

Backend and ML engineers developing or operating pipelines that extract structured data from documents, images, and audio using LLMs. Particularly useful for developers handling production environments where the accuracy of JSON output impacts downstream systems.

Core Mechanics

Existing benchmarks (JSONSchemaBench, StructEval, etc.) only verify if a response is parsable JSON and passes the schema, allowing perfectly formatted but incorrect JSON to receive a perfect score and failing to measure real-world production reliability.
SOB evaluates across text (HotpotQA 5,000), image (olmOCR-bench 209), and audio (AMI Meeting Corpus 115) modalities using a unified scoring pipeline, reflecting real-world input environments like OCR, screenshots, and meeting transcripts.
Images and audio recordings are normalized to text before evaluation, isolating pure structured output capability and excluding vision or ASR (speech recognition) performance.
SOB reports seven metrics separately: Value Accuracy (exact value match), JSON Pass Rate (parsability), Type Safety (type match), Structure Coverage (structure inclusion), Path Recall (required key inclusion), Faithfulness (source grounding), and Perfect Response (complete record match). Value Accuracy is the most critical metric for production.
Two gates prevent score inflation: JSON parsing failures result in zero scores for all downstream semantic metrics, and Value Accuracy only scores fields actually returned by the model, penalizing omissions.
Schema difficulty is tagged as easy (1.0), medium (2.0), and hard (3.0) with corresponding weights applied to the final leaderboard, rewarding models that handle complex nested structures well.
All evaluations run with temperature 0.0, max output 2048 tokens, and inference/thinking capabilities disabled to reflect pure structured output/extraction ability.
Leaderboard highlights: 1st GPT-5.4 (Overall 0.870, Value Acc 0.798), 2nd GLM-4.7 (0.861, 0.804), 3rd Qwen3.5-35B (0.861, 0.801), 4th Gemini-2.5-Flash (0.860, 0.796), 5th Qwen3-235B (0.857, 0.786). Structural metrics (JSON Pass, Path Recall, etc.) are near ceiling across models, with differences arising in Value Accuracy and Perfect Response.

Evidence

"Shared experiences highlight the vulnerability of simultaneously requesting 'input parsing' and 'JSON formatting' in a single LLM call. A two-step approach—performing the task first, then wrapping the result in JSON with a separate LLM call—significantly improves quality, especially in agentic state machines requiring HTML/JS/Python code snippets within JSON."

How to Apply

If building pipelines to extract JSON from invoices, medical records, or meeting transcripts, select models based on the Value Accuracy and Perfect Response columns of the SOB leaderboard. These two metrics more directly reflect production reliability than the overall score.
For cost-sensitive, high-volume JSON extraction tasks, consider Qwen3.5-35B as an alternative to GPT-5.4. It potentially offers comparable accuracy at a significantly lower cost.
If encountering frequent errors when simultaneously parsing input and generating JSON with a single LLM call, experiment with a two-step approach: complete the task as free text first, then convert the result to JSON with a separate LLM call.
To measure the structured output quality of your own LLM pipeline, adapt SOB’s seven-metric framework (JSON Pass → Structure Coverage → Path Recall → Type Safety → Value Accuracy → Faithfulness → Perfect Response) as a hierarchical framework for internal evaluation.

Terminology

Value AccuracyThe percentage of fields in the JSON returned by the LLM that exactly match the actual ground truth. Schema format is irrelevant if the values are incorrect.

Perfect ResponseA strict metric recognizing only records where all fields are completely accurate. If Value Accuracy is average, Perfect Response represents the percentage of records achieving a perfect score.

Path RecallA metric measuring whether the JSON response includes all keys (paths) required by the schema. Missing intermediate keys in deeply nested structures lowers the score.

FaithfulnessA metric measuring whether the values output by the model are actually grounded in the source document (text, image, or audio). Generating values not present in the source (hallucination) lowers the score.

Structured DecodingA technique that forces the LLM to select only tokens that conform to the JSON schema when generating tokens. It guarantees syntactically valid JSON but does not guarantee value accuracy.

ASRAbbreviation for Automatic Speech Recognition, the technology that converts speech to text. It is a preprocessing step when extracting structured data from audio sources.