Show HN: A new benchmark for testing LLMs for deterministic outputs
TL;DR Highlight
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Who Should Read
Backend and ML engineers developing or operating pipelines that extract structured data from documents, images, and audio using LLMs. Particularly useful for developers handling production environments where the accuracy of JSON output impacts downstream systems.
Core Mechanics
- Existing benchmarks (JSONSchemaBench, StructEval, etc.) only verify if a response is parsable JSON and passes the schema, allowing perfectly formatted but incorrect JSON to receive a perfect score and failing to measure real-world production reliability.
- SOB evaluates across text (HotpotQA 5,000), image (olmOCR-bench 209), and audio (AMI Meeting Corpus 115) modalities using a unified scoring pipeline, reflecting real-world input environments like OCR, screenshots, and meeting transcripts.
- Images and audio recordings are normalized to text before evaluation, isolating pure structured output capability and excluding vision or ASR (speech recognition) performance.
- SOB reports seven metrics separately: Value Accuracy (exact value match), JSON Pass Rate (parsability), Type Safety (type match), Structure Coverage (structure inclusion), Path Recall (required key inclusion), Faithfulness (source grounding), and Perfect Response (complete record match). Value Accuracy is the most critical metric for production.
- Two gates prevent score inflation: JSON parsing failures result in zero scores for all downstream semantic metrics, and Value Accuracy only scores fields actually returned by the model, penalizing omissions.
- Schema difficulty is tagged as easy (1.0), medium (2.0), and hard (3.0) with corresponding weights applied to the final leaderboard, rewarding models that handle complex nested structures well.
- All evaluations run with temperature 0.0, max output 2048 tokens, and inference/thinking capabilities disabled to reflect pure structured output/extraction ability.
- Leaderboard highlights: 1st GPT-5.4 (Overall 0.870, Value Acc 0.798), 2nd GLM-4.7 (0.861, 0.804), 3rd Qwen3.5-35B (0.861, 0.801), 4th Gemini-2.5-Flash (0.860, 0.796), 5th Qwen3-235B (0.857, 0.786). Structural metrics (JSON Pass, Path Recall, etc.) are near ceiling across models, with differences arising in Value Accuracy and Perfect Response.
Evidence
- "Shared experiences highlight the vulnerability of simultaneously requesting 'input parsing' and 'JSON formatting' in a single LLM call. A two-step approach—performing the task first, then wrapping the result in JSON with a separate LLM call—significantly improves quality, especially in agentic state machines requiring HTML/JS/Python code snippets within JSON."
How to Apply
- If building pipelines to extract JSON from invoices, medical records, or meeting transcripts, select models based on the Value Accuracy and Perfect Response columns of the SOB leaderboard. These two metrics more directly reflect production reliability than the overall score.
- For cost-sensitive, high-volume JSON extraction tasks, consider Qwen3.5-35B as an alternative to GPT-5.4. It potentially offers comparable accuracy at a significantly lower cost.
- If encountering frequent errors when simultaneously parsing input and generating JSON with a single LLM call, experiment with a two-step approach: complete the task as free text first, then convert the result to JSON with a separate LLM call.
- To measure the structured output quality of your own LLM pipeline, adapt SOB’s seven-metric framework (JSON Pass → Structure Coverage → Path Recall → Type Safety → Value Accuracy → Faithfulness → Perfect Response) as a hierarchical framework for internal evaluation.
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Claude.ai unavailable and elevated errors on the API
Anthropic’s entire service suite—Claude.ai, the API, Claude Code—became inaccessible for 1 hour and 18 minutes (17:34–18:52 UTC), sparking outrage among enterprise users over reliability concerns.