Why LLMs can't really build software
TL;DR Highlight
Zed editor's CEO argues that LLMs lack the core software engineering skill of 'maintaining mental models' — analyzing the realistic limits and proper use of AI coding tools.
Who Should Read
Developers using or evaluating AI coding tools (Cursor, Claude Code, Copilot, etc.) who need criteria for deciding what to delegate to LLMs vs. handle personally.
Core Mechanics
- The core of software engineering is simultaneously maintaining two mental models — 'what the requirements are' and 'what the code actually does' — and iteratively closing the gap. LLMs can't hold and compare both models at once.
- LLMs generate code well but can't judge whether to fix the code or fix the test when tests fail. When frustrated, they tend to rewrite everything from scratch.
- Breaking work into small units is key — LLMs struggle with large, interconnected changes but handle focused, well-scoped tasks much better.
- TDD (Test-Driven Development) with explicit Red-Green-Refactor stages in prompts helps LLMs stay on track.
Evidence
- A developer using Cline + Sonnet 3.7 for Rails TDD countered: 'Writing tests first and reviewing in small units works surprisingly well — at least junior engineer level.' But admitted 'not perfect, some bugs it can't solve.'
- A GPT-5 user building a WebGPU/wgpu renderer hit a wall with runtime errors — the model kept making changes without understanding the problem, eventually breaking more than it fixed.
- Consensus: LLMs are powerful code generators but poor software engineers — the gap is in judgment, not generation.
How to Apply
- When delegating complex tasks to LLMs, break them into small units — avoids mental model maintenance failures and dramatically reduces wasted iterations.
- For TDD-based LLM usage, specify the Red-Green-Refactor stage in your prompt: 'You are in the GREEN stage — write only the minimal code to pass this test.' This prevents code/test confusion.
- Use LLMs for code generation but maintain the mental model yourself — review generated code against requirements rather than trusting it to understand the system holistically.
Terminology
Mental ModelAn internal map of 'how the system works' built in your head. Lets you predict 'if this input comes in, that component processes it like so' without looking at code.
Context WindowThe maximum text an LLM can read and reference at once. Like desk space — the wider it is, the more documents you can spread out simultaneously.
TDDTest-Driven Development. Write the test first, then write code to make it pass, then refactor. Forces small, verifiable increments.