Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step

Feb 25, 2024•Li Zhong, Zilong Wang, Jingbo Shang•View PDF

TL;DR Highlight

A framework that automatically finds and fixes bugs by tracking intermediate variable values block-by-block in LLM-generated code using a breakpoint approach.

Who Should Read

Developers building LLM-based code generation/auto-debugging pipelines, or AI coding tool developers wanting to improve quality of code generated by GPT-4 or CodeLlama.

Core Mechanics

Existing methods treat code as 'one chunk' and only use test failure messages as feedback — LDB tracks actual intermediate variable values block-by-block during execution for precise bug localization
Decomposes programs into Basic Blocks based on CFG (Control Flow Graph), showing the LLM variable states before and after each block execution for validation
When execution traces get long from loops/recursion, uses Selective Debugging — sampling only the first 5 + last 5 blocks to solve token overflow
Batch Debugging queries block verification results in batch, reducing iterative debugging token costs from O(B²×N) to O(B×N)
Applicable to high-quality code generated by GPT-4 or Reflexion — catches bugs even strong generators miss (Reflexion + LDB = HumanEval 95.1%)
~80% of bugs were Semantic Errors (syntactically correct but not behaving as intended) — LDB is especially good at catching these due to execution information

Evidence

GPT-3.5: HumanEval +9.1%, TransCoder +5.4%, MBPP +8.4% improvement over existing Self-Debugging — consistently better
CodeLlama-34B: TransCoder +9.8% — highest improvement across all experiments
Reflexion-generated code + LDB (GPT-3.5) combination: HumanEval 95.1%; GPT-4o backbone: 98.2%
Bug Localization accuracy: HumanEval 93.7%, MBPP 95.3%, TransCoder 86.7% (GPT-4 auto-verified)

How to Apply

After LLM code generation, when tests fail: collect per-Basic-Block execution variable snapshots before and after using Python's sys.settrace() or AST analysis, put them in a prompt and get JSON block-by-block pass/fail judgments from the LLM — implement a debugging loop
If already using GPT-4 or Claude for code generation, adding a pipeline that post-processes generated code with an LDB-style GPT-3.5 debugger lowers cost while improving quality
For code with long loops, use 'first N + last N blocks' sampling instead of full execution trace to construct a debugging prompt without context length overflow

Code Example

snippet

# LDB-style debugging prompt example (Chat mode)

system_prompt = "You are an expert programming assistant."

# Step 1: Code generation
user_msg_1 = """
Complete the following task in Python. Please respond with code only.
def is_sorted(lst):
    '''
    Given a list of numbers, return whether or not they are
    sorted in ascending order. If list has more than 1 duplicate
    of the same number, return False.
    '''
"""

# Step 2: Debugging request with failed test case and block-by-block execution trace
debugging_prompt = """
The code above fails the given unit test:
assert is_sorted([1, 2, 2, 3, 3, 4]) == True  # Real Execution Output: False

Here is the code execution trace block by block with intermediate variable values.
For EACH BLOCK, answer whether it is correct or not.
Return a JSON with keys: `block`, `correct`, `explanation`.

[BLOCK-0]
# lst=[1, 2, 2, 3, 3, 4]
for i in range(len(lst) - 1):
    if lst[i] > lst[i + 1]:
# i=0, lst=[1, 2, 2, 3, 3, 4]

[BLOCK-5]
# i=4, lst=[1, 2, 2, 3, 3, 4]
return not any(lst.count(x) > 1 for x in lst)
# i=4, lst=[1, 2, 2, 3, 3, 4], _ret=False
"""

# Step 3: Code regeneration based on debugging results
regeneration_prompt = "Please fix the Python code based on the debugging feedback above."

# Simple example of tracking variable states per basic block in Python
import sys

def trace_blocks(func, test_input):
    """Captures the local variable state of each line during function execution"""
    states = []
    
    def tracer(frame, event, arg):
        if event == 'line':
            states.append({
                'line': frame.f_lineno,
                'locals': dict(frame.f_locals)
            })
        return tracer
    
    sys.settrace(tracer)
    try:
        result = func(test_input)
    finally:
        sys.settrace(None)
    
    return states, result

# Usage example:
# states, result = trace_blocks(is_sorted, [1, 2, 2, 3, 3, 4])
# → Group states by block unit and insert into LLM prompt

Terminology

Basic BlockA consecutive code segment with 'one entry, one exit.' A straight-line code section before branching at if statements or loops.

CFG (Control Flow Graph)All execution paths of code represented as a graph. Each node is a Basic Block, arrows show branching direction — shows program execution order at a glance.

Execution TraceThe sequence of Basic Blocks traversed during actual program execution. The same code can produce different traces depending on input values.

Pass@1The rate at which LLM-generated code passes tests on the first try. Higher means higher probability of getting the right answer on the first attempt.

Self-DebuggingA technique where an LLM explains its own code ('rubber duck debugging') to find bugs. LDB goes beyond explanation to use actual runtime state.

Related Resources

https://github.com/FloridSleeves/LLMDebugger

Original Abstract (Expand)

Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce Large Language Model Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.