Diagnosing CFG Interpretation in LLMs | AI Paper Digest

TL;DR Highlight

LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.

Who Should Read

Backend/ML engineers curious about how reliably LLMs can handle structured outputs like JSON schemas, function signatures, and DSLs. Specifically, developers concerned with the stability of tool-calling or code generation pipelines.

Core Mechanics

LLM performance degrades hierarchically: Syntax → Behavior → Semantics. While LLMs can somewhat follow grammar rules, they frequently fail to maintain logical consistency.
Performance plummets non-linearly with increasing nesting depth. With Mimo-V2-flash, semantic accuracy (SCR) dropped from 39% at depth=2 to 0.5% at depth=20.
Model size doesn't guarantee better performance. GPT-5-mini outperformed the larger GPT-5.2 (90% vs. 60% BER) on Goal-Conditioned Generation Tasks.
Using 'Alien' lexicons (random tokens like v_xkqm instead of meaningful keywords like loop) drastically reduces performance, indicating LLMs rely on keyword semantics rather than pure grammar inference.
Chain-of-Thought (CoT) reasoning is essential but not a panacea. Without CoT, SVR falls below 10% even at depth=2, but even with CoT, SCR collapses to 0.5% at depth=20.
Increasing few-shot examples doesn't help with deep recursion and can even be detrimental. 1-shot performance is often worse than 0-shot, and 5-shot frequently fails to surpass 0-shot.

Evidence

DeepSeek-V3.2, the strongest open-source model, achieved only 39.5% Instruction-to-Code SCR with an 'Alien' lexicon and depth=10. Qwen3-8B scored 0% SCR.
GPT-5-mini exhibited 65% SVR and 9.23% CSCR (the proportion of syntactically and semantically correct outputs) on Task 3, indicating poor semantic alignment.
Recursion depth ablation: SVR dropped from 42% and SCR from 39% at depth=2 to SVR 10% and SCR 0.5% at depth=20 (Mimo-V2-flash).
Adding an else branch to every if statement (raising Else-branch probability to 1.0) reduced SVR from 21.5% to 13.0%. Switching to S-expr style halved SCR from 9% to 4.5%.

How to Apply

When designing LLM interactions with dynamically defined JSON schemas or function signatures, minimize nesting depth. Structural error probability increases non-linearly with depth, making multi-stage calls preferable to complex schemas.
Always enable Chain-of-Thought (CoT) when prompting LLM tool-calling agents with custom DSLs or new grammars. Without CoT, SVR drops below 10% even at shallow depths; include 'think step by step' instructions for structured output requests.
When defining keywords for new grammars, prioritize words the model already knows (loop, if, return). Using entirely new tokens can lead to performance degradation as the model fails to infer keyword semantics (Natural vs Alien: SVR 24.5% vs 21.5%, BER 21.5% vs 17.5%).

Code Example

snippet

# Prompt pattern for making LLMs follow a new grammar in ROBOGRID style
# CoT activation + explicit EBNF provision are key

prompt = """
You are a strict code generator. Think step by step before generating code.

You are given a NEW programming language definition (EBNF):

start: stmt+
stmt: action_stmt | loop | if_stmt
action_stmt: DO action END
loop: LOOP INT TIMES LBR stmt+ RBR
if_stmt: IF cond THEN LBR stmt+ RBR (ELSE LBR stmt+ RBR)?
action: MOVE MOVE_DIR INT? | TURN TURN_DIR | GRAB ITEM
cond: HOLDING ITEM

Terminal mappings:
DO: "exec"  END: "stop"
LOOP: "repeat"  TIMES: "times"
IF: "when"  THEN: "then"
LBR: "{"  RBR: "}"

Instructions:
{natural_language_instruction}

First, break down the instruction into a tree structure (AST).
Then, translate each node into the grammar terminals above.
Finally, output the complete valid program.
"""

# Key design principles:
# 1. Explicitly state EBNF at the beginning of the prompt
# 2. Induce CoT ('Think step by step' or step-by-step instructions)
# 3. Limit nesting depth to a maximum of 5 (depth>10 results in almost 0 SCR)
# 4. Keep keywords close to natural language (avoid Alien tokens)

Terminology

CFGContext-Free Grammar. A set of rules defining the structure of a language, like a programming language or JSON. Defines nested rules like 'a loop can contain an if statement, and an if statement can contain another loop'.

EBNFExtended Backus-Naur Form. A standard notation for writing grammar rules. It's like a recipe: 'combine these tokens in this order to create a valid sentence'.

ASTAbstract Syntax Tree. A tree-like representation of code. 'If the condition is true, execute A; otherwise, execute B' is represented as parent-child nodes.

SVRSyntax Validity Rate. The percentage of code generated by an LLM that adheres to grammar rules. The proportion of output that can be parsed without syntax errors.

SCRSemantic Correctness Rate. The percentage of generated code whose AST structure exactly matches the ground truth. The most stringent metric of the three.

CoTChain-of-Thought. A technique that encourages LLMs to write out intermediate reasoning steps before providing a final answer. Asking the model to 'show your work'.

in-context grammar interpretationThe ability of an LLM to learn and follow new grammar rules provided within the prompt itself, without additional training. Following new language rules solely from the prompt.

semantic bootstrappingThe phenomenon where a model relies on pre-existing knowledge of keyword meanings (e.g., 'loop' or 'if') rather than purely parsing grammar rules. Pretending to read a rulebook but actually guessing based on word meanings.

Related Papers

Original Abstract (Expand)

As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.