Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step
TL;DR Highlight
A framework that automatically finds and fixes bugs by tracking intermediate variable values block-by-block in LLM-generated code using a breakpoint approach.
Who Should Read
Developers building LLM-based code generation/auto-debugging pipelines, or AI coding tool developers wanting to improve quality of code generated by GPT-4 or CodeLlama.
Core Mechanics
- Existing methods treat code as 'one chunk' and only use test failure messages as feedback — LDB tracks actual intermediate variable values block-by-block during execution for precise bug localization
- Decomposes programs into Basic Blocks based on CFG (Control Flow Graph), showing the LLM variable states before and after each block execution for validation
- When execution traces get long from loops/recursion, uses Selective Debugging — sampling only the first 5 + last 5 blocks to solve token overflow
- Batch Debugging queries block verification results in batch, reducing iterative debugging token costs from O(B²×N) to O(B×N)
- Applicable to high-quality code generated by GPT-4 or Reflexion — catches bugs even strong generators miss (Reflexion + LDB = HumanEval 95.1%)
- ~80% of bugs were Semantic Errors (syntactically correct but not behaving as intended) — LDB is especially good at catching these due to execution information
Evidence
- GPT-3.5: HumanEval +9.1%, TransCoder +5.4%, MBPP +8.4% improvement over existing Self-Debugging — consistently better
- CodeLlama-34B: TransCoder +9.8% — highest improvement across all experiments
- Reflexion-generated code + LDB (GPT-3.5) combination: HumanEval 95.1%; GPT-4o backbone: 98.2%
- Bug Localization accuracy: HumanEval 93.7%, MBPP 95.3%, TransCoder 86.7% (GPT-4 auto-verified)
How to Apply
- After LLM code generation, when tests fail: collect per-Basic-Block execution variable snapshots before and after using Python's sys.settrace() or AST analysis, put them in a prompt and get JSON block-by-block pass/fail judgments from the LLM — implement a debugging loop
- If already using GPT-4 or Claude for code generation, adding a pipeline that post-processes generated code with an LDB-style GPT-3.5 debugger lowers cost while improving quality
- For code with long loops, use 'first N + last N blocks' sampling instead of full execution trace to construct a debugging prompt without context length overflow
Code Example
# LDB-style debugging prompt example (Chat mode)
system_prompt = "You are an expert programming assistant."
# Step 1: Code generation
user_msg_1 = """
Complete the following task in Python. Please respond with code only.
def is_sorted(lst):
'''
Given a list of numbers, return whether or not they are
sorted in ascending order. If list has more than 1 duplicate
of the same number, return False.
'''
"""
# Step 2: Debugging request with failed test case and block-by-block execution trace
debugging_prompt = """
The code above fails the given unit test:
assert is_sorted([1, 2, 2, 3, 3, 4]) == True # Real Execution Output: False
Here is the code execution trace block by block with intermediate variable values.
For EACH BLOCK, answer whether it is correct or not.
Return a JSON with keys: `block`, `correct`, `explanation`.
[BLOCK-0]
# lst=[1, 2, 2, 3, 3, 4]
for i in range(len(lst) - 1):
if lst[i] > lst[i + 1]:
# i=0, lst=[1, 2, 2, 3, 3, 4]
[BLOCK-5]
# i=4, lst=[1, 2, 2, 3, 3, 4]
return not any(lst.count(x) > 1 for x in lst)
# i=4, lst=[1, 2, 2, 3, 3, 4], _ret=False
"""
# Step 3: Code regeneration based on debugging results
regeneration_prompt = "Please fix the Python code based on the debugging feedback above."
# Simple example of tracking variable states per basic block in Python
import sys
def trace_blocks(func, test_input):
"""Captures the local variable state of each line during function execution"""
states = []
def tracer(frame, event, arg):
if event == 'line':
states.append({
'line': frame.f_lineno,
'locals': dict(frame.f_locals)
})
return tracer
sys.settrace(tracer)
try:
result = func(test_input)
finally:
sys.settrace(None)
return states, result
# Usage example:
# states, result = trace_blocks(is_sorted, [1, 2, 2, 3, 3, 4])
# → Group states by block unit and insert into LLM promptTerminology
Related Resources
Original Abstract (Expand)
Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce Large Language Model Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.