Diagnosing CFG Interpretation in LLMs
TL;DR Highlight
LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.
Who Should Read
Backend/ML engineers curious about how reliably LLMs can handle structured outputs like JSON schemas, function signatures, and DSLs. Specifically, developers concerned with the stability of tool-calling or code generation pipelines.
Core Mechanics
- LLM performance degrades hierarchically: Syntax → Behavior → Semantics. While LLMs can somewhat follow grammar rules, they frequently fail to maintain logical consistency.
- Performance plummets non-linearly with increasing nesting depth. With Mimo-V2-flash, semantic accuracy (SCR) dropped from 39% at depth=2 to 0.5% at depth=20.
- Model size doesn't guarantee better performance. GPT-5-mini outperformed the larger GPT-5.2 (90% vs. 60% BER) on Goal-Conditioned Generation Tasks.
- Using 'Alien' lexicons (random tokens like v_xkqm instead of meaningful keywords like loop) drastically reduces performance, indicating LLMs rely on keyword semantics rather than pure grammar inference.
- Chain-of-Thought (CoT) reasoning is essential but not a panacea. Without CoT, SVR falls below 10% even at depth=2, but even with CoT, SCR collapses to 0.5% at depth=20.
- Increasing few-shot examples doesn't help with deep recursion and can even be detrimental. 1-shot performance is often worse than 0-shot, and 5-shot frequently fails to surpass 0-shot.
Evidence
- DeepSeek-V3.2, the strongest open-source model, achieved only 39.5% Instruction-to-Code SCR with an 'Alien' lexicon and depth=10. Qwen3-8B scored 0% SCR.
- GPT-5-mini exhibited 65% SVR and 9.23% CSCR (the proportion of syntactically and semantically correct outputs) on Task 3, indicating poor semantic alignment.
- Recursion depth ablation: SVR dropped from 42% and SCR from 39% at depth=2 to SVR 10% and SCR 0.5% at depth=20 (Mimo-V2-flash).
- Adding an else branch to every if statement (raising Else-branch probability to 1.0) reduced SVR from 21.5% to 13.0%. Switching to S-expr style halved SCR from 9% to 4.5%.
How to Apply
- When designing LLM interactions with dynamically defined JSON schemas or function signatures, minimize nesting depth. Structural error probability increases non-linearly with depth, making multi-stage calls preferable to complex schemas.
- Always enable Chain-of-Thought (CoT) when prompting LLM tool-calling agents with custom DSLs or new grammars. Without CoT, SVR drops below 10% even at shallow depths; include 'think step by step' instructions for structured output requests.
- When defining keywords for new grammars, prioritize words the model already knows (loop, if, return). Using entirely new tokens can lead to performance degradation as the model fails to infer keyword semantics (Natural vs Alien: SVR 24.5% vs 21.5%, BER 21.5% vs 17.5%).
Code Example
# Prompt pattern for making LLMs follow a new grammar in ROBOGRID style
# CoT activation + explicit EBNF provision are key
prompt = """
You are a strict code generator. Think step by step before generating code.
You are given a NEW programming language definition (EBNF):
start: stmt+
stmt: action_stmt | loop | if_stmt
action_stmt: DO action END
loop: LOOP INT TIMES LBR stmt+ RBR
if_stmt: IF cond THEN LBR stmt+ RBR (ELSE LBR stmt+ RBR)?
action: MOVE MOVE_DIR INT? | TURN TURN_DIR | GRAB ITEM
cond: HOLDING ITEM
Terminal mappings:
DO: "exec" END: "stop"
LOOP: "repeat" TIMES: "times"
IF: "when" THEN: "then"
LBR: "{" RBR: "}"
Instructions:
{natural_language_instruction}
First, break down the instruction into a tree structure (AST).
Then, translate each node into the grammar terminals above.
Finally, output the complete valid program.
"""
# Key design principles:
# 1. Explicitly state EBNF at the beginning of the prompt
# 2. Induce CoT ('Think step by step' or step-by-step instructions)
# 3. Limit nesting depth to a maximum of 5 (depth>10 results in almost 0 SCR)
# 4. Keep keywords close to natural language (avoid Alien tokens)Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Original Abstract (Expand)
As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.