MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

Jan 9, 2026•Jiefu Ou, Sapana Chaudhary, Kaj Bostrom +4•View PDF

TL;DR Highlight

When an LLM automatically optimizes CUDA kernels and C++ code, it converts execution feedback into natural language critiques and uses RL to guide search, achieving up to 27% additional speedup over existing methods.

Who Should Read

ML infrastructure/systems engineers looking to automate CUDA kernel optimization for PyTorch models or high-performance C++ code improvement. AI engineers exploring how to attach execution feedback loops to LLM-based code generation agents.

Core Mechanics

Reframes code optimization as 'max-reward RL (RL targeting the highest reward ever achieved, rather than cumulative reward)', always keeping the best-performing code found so far in context
Instead of providing only numeric execution results, a separate critique model (Claude-3.7-Sonnet) generates natural language diagnoses such as 'memory bandwidth bottleneck' or 'incorrect operation ordering' and passes them to the LLM
A reward-to-go model (fine-tuned Qwen2.5-7B-Instruct) predicts the 'expected future maximum speedup' along a search path, enabling selection of promising candidates without running them
Plugging existing search methods (CUDA-LLM, Effi-Learner) into this framework yields immediate performance gains without retraining — operates as a plug-in
The same operation implemented as a 'sequential sub-kernel chain' vs. a 'fully fused kernel' can yield nearly identical speeds, meaning simple fast/slow feedback alone is insufficient to guide optimization direction — a problem the critique model addresses
Confirmed scaling effect: as inference budget (search depth) increases, MaxCode improves performance faster than existing methods

Evidence

CUDA-LLM + MaxCode combination improves KernelBench Level 1 from 2.49x → 3.17x (27.3% relative improvement) and Level 2 from 1.45x → 1.61x (11.0% improvement)
On the PIE (C++ optimization) benchmark, CUDA-LLM + MaxCode improves from 1.42x → 1.74x (22.5% relative improvement), with average ranking also improving from 2.05 → 1.74
Full MaxCode (Traj Critique Best Perf) consistently achieves higher max speedup than individual components (Critique-only or Best Perf-only) — the combination effect is key
The reward-to-go model shows ranking improvements on KernelBench L2 and PIE (1.57→1.33, 1.55→1.43), though performance degrades on L1 due to distribution mismatch

How to Apply

Add a critique step to existing LLM code optimization loops: instead of appending raw execution results directly to the prompt, first pass them to a separate LLM (Claude-3.7-Sonnet with extended thinking enabled) asking it to 'diagnose the bottlenecks in this code and suggest improvements', then inject that natural language critique into the main prompt
Always include 'the fastest version of the code found so far + its feedback' in the prompt during iterative optimization loops: this lets the LLM understand how the current attempt compares to the best known result, reducing meaningless regressions
When GPU execution costs are prohibitive due to a large number of search candidates, fine-tune a small model like Qwen2.5-7B with LoRA as a reward predictor — use it solely to pre-filter low-potential candidates before actual execution

Code Example

snippet

# MaxCode core prompt structure (Traj Critique Best Perf version)

GENERATOR_PROMPT = """
You write custom CUDA kernels to replace the pytorch operators in the given architecture to get speedups.

You are provided with:
1. The pytorch architecture to optimize
2. Your BEST-PERFORMING optimization so far and its execution feedback
3. Your TRAJECTORY of previous attempts with execution feedback
4. NATURAL LANGUAGE CRITIQUES for each attempt

Given this information, refine your optimization:
- If compiled=False: fix compilation errors (refer to best-performing solution for cues)
- If correctness=False: fix logic errors
- If correct: reduce runtime below the best-performing solution so far

IMPLEMENT CUDA OPERATORS using:
from torch.utils.cpp_extension import load_inline
"""

CRITIQUE_PROMPT = """
Given the optimization attempt and execution feedback:
1. Diagnose: What are the performance bottlenecks? (memory bandwidth? compute utilization? algorithmic inefficiency?)
2. Suggest: Provide actionable steps to fix or improve performance
   - If compile error: explain why it fails and how to fix
   - If correct but slow: identify specific bottleneck and optimization strategy
"""

# Usage example (pseudo-code)
def maxcode_loop(initial_code, max_depth=8, k_candidates=8):
    best_code, best_speedup = initial_code, 1.0
    trajectory = []
    
    for depth in range(max_depth):
        # 1. Generate k candidates
        candidates = [llm.generate(GENERATOR_PROMPT, 
                                    arch=initial_code,
                                    best=(best_code, best_speedup),
                                    trajectory=trajectory) 
                      for _ in range(k_candidates)]
        
        # 2. Execute and measure speedup
        results = [execute_and_measure(c) for c in candidates]
        
        # 3. Generate critiques
        critiques = [llm.generate(CRITIQUE_PROMPT, code=c, feedback=r) 
                     for c, r in zip(candidates, results)]
        
        # 4. Select the fastest candidate
        best_idx = max(range(k_candidates), key=lambda i: results[i].speedup)
        if results[best_idx].speedup > best_speedup:
            best_code = candidates[best_idx]
            best_speedup = results[best_idx].speedup
        
        trajectory.append((candidates[best_idx], results[best_idx], critiques[best_idx]))
    
    return best_code, best_speedup

Terminology

max-reward RLWhile standard RL maximizes the total cumulative score, max-reward RL targets the single highest score achieved at any point. In code optimization, success means finding the fastest code even once across many attempts, making this approach well-suited.

CUDA 커널Compute code that runs directly on the GPU. PyTorch internally uses general-purpose CUDA kernels, but writing custom ones tailored to specific operations can yield several times the performance. CUDA kernel development requires deep understanding of GPU architecture and carries a high difficulty level.

Reward-to-goA model that predicts the maximum reward attainable from the current state going forward. It serves to identify candidates likely to yield better long-term results before actually executing them.

critique 모델An LLM that receives code execution results and explains 'why it is slow' and 'what needs to be fixed' in human-readable language. It provides far better optimization direction than numeric feedback alone.

KernelBenchA benchmark for evaluating LLMs' ability to optimize CUDA kernels. It presents 250 neural network operations in PyTorch and tasks the model with rewriting them as faster CUDA code. Four difficulty levels range from simple operations to full HuggingFace architectures.

PIEAn optimization benchmark composed of 77K competitive-programming-level C++ code pairs (slow version / fast version). Evaluates the ability to optimize C++ algorithms.

LoRAA fine-tuning technique that adds only two small matrices without modifying the full model parameters, reducing training cost to less than 1/10th of full fine-tuning. Commonly used to quickly adapt small models to specific tasks.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.