Evaluating LLM-Based Test Generation Under Software Evolution

Mar 24, 2026•Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar•View PDF

TL;DR Highlight

Large-scale study with 8 LLMs and 22,374 program variants — over 99% of LLM-generated tests remain aligned to original code patterns, degrading sharply after code changes

Who Should Read

Developers adopting LLM-based automated test generation in CI/CD; engineers assessing the reliability of AI coding tools

Core Mechanics

Baseline on original code: 79.2% line coverage, 76.1% branch — high baseline, but collapses after code changes
After semantic-altering changes (SAC): pass rate 66.5% (▼33.4pp), branch 60.6% (▼15.5pp) — over 99% of failing tests pass on the original code (residual alignment)
Even after semantic-preserving changes (SPC): pass rate 78.9% (▼21pp) — structural noise alone degrades performance despite unchanged behavior
Python is much harder than Java — Gemini 2.5 Flash needed 323 attempts for Java vs 2,155 for Python
Poor regression awareness — test suite turnover rate 70% (SAC) and 82% (SPC); existing high-quality tests discarded and replaced with lower-quality ones
Most resilient: GPT-5 Mini (SAC 82.9%), Claude 4.6 Sonnet. GPT-5.2 shows memorization overfitting (drops to 56.2% under SAC)

Evidence

8 LLMs (GPT-5, GPT-5.2, Claude 4.5 Haiku, Claude 4.6 Sonnet, Gemini 2.5 Flash, Gemini 3.1 Pro, GPT-OSS, Nemotron-3-Nano), 22,374 variants, 346M tokens consumed
5 SAC types (boundary shift, boolean logic, arithmetic, argument swap, variable role) + 9 SPC types (redundant else, void loop, unused parameter, misleading variable names, Mandarin comments) systematically applied

How to Apply

Always regenerate and cross-validate LLM-generated tests after code changes — never reuse them as regression tests without re-verification
Include the code diff and intent in the prompt to reduce residual alignment
Python projects have lower LLM test generation reliability than Java — increase manual verification coverage

Terminology

Original Abstract (Expand)

Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.