MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization
TL;DR Highlight
When an LLM automatically optimizes CUDA kernels and C++ code, it converts execution feedback into natural language critiques and uses RL to guide search, achieving up to 27% additional speedup over existing methods.
Who Should Read
ML infrastructure/systems engineers looking to automate CUDA kernel optimization for PyTorch models or high-performance C++ code improvement. AI engineers exploring how to attach execution feedback loops to LLM-based code generation agents.
Core Mechanics
- Reframes code optimization as 'max-reward RL (RL targeting the highest reward ever achieved, rather than cumulative reward)', always keeping the best-performing code found so far in context
- Instead of providing only numeric execution results, a separate critique model (Claude-3.7-Sonnet) generates natural language diagnoses such as 'memory bandwidth bottleneck' or 'incorrect operation ordering' and passes them to the LLM
- A reward-to-go model (fine-tuned Qwen2.5-7B-Instruct) predicts the 'expected future maximum speedup' along a search path, enabling selection of promising candidates without running them
- Plugging existing search methods (CUDA-LLM, Effi-Learner) into this framework yields immediate performance gains without retraining — operates as a plug-in
- The same operation implemented as a 'sequential sub-kernel chain' vs. a 'fully fused kernel' can yield nearly identical speeds, meaning simple fast/slow feedback alone is insufficient to guide optimization direction — a problem the critique model addresses
- Confirmed scaling effect: as inference budget (search depth) increases, MaxCode improves performance faster than existing methods
Evidence
- CUDA-LLM + MaxCode combination improves KernelBench Level 1 from 2.49x → 3.17x (27.3% relative improvement) and Level 2 from 1.45x → 1.61x (11.0% improvement)
- On the PIE (C++ optimization) benchmark, CUDA-LLM + MaxCode improves from 1.42x → 1.74x (22.5% relative improvement), with average ranking also improving from 2.05 → 1.74
- Full MaxCode (Traj Critique Best Perf) consistently achieves higher max speedup than individual components (Critique-only or Best Perf-only) — the combination effect is key
- The reward-to-go model shows ranking improvements on KernelBench L2 and PIE (1.57→1.33, 1.55→1.43), though performance degrades on L1 due to distribution mismatch
How to Apply
- Add a critique step to existing LLM code optimization loops: instead of appending raw execution results directly to the prompt, first pass them to a separate LLM (Claude-3.7-Sonnet with extended thinking enabled) asking it to 'diagnose the bottlenecks in this code and suggest improvements', then inject that natural language critique into the main prompt
- Always include 'the fastest version of the code found so far + its feedback' in the prompt during iterative optimization loops: this lets the LLM understand how the current attempt compares to the best known result, reducing meaningless regressions
- When GPU execution costs are prohibitive due to a large number of search candidates, fine-tune a small model like Qwen2.5-7B with LoRA as a reward predictor — use it solely to pre-filter low-potential candidates before actual execution
Code Example
# MaxCode core prompt structure (Traj Critique Best Perf version)
GENERATOR_PROMPT = """
You write custom CUDA kernels to replace the pytorch operators in the given architecture to get speedups.
You are provided with:
1. The pytorch architecture to optimize
2. Your BEST-PERFORMING optimization so far and its execution feedback
3. Your TRAJECTORY of previous attempts with execution feedback
4. NATURAL LANGUAGE CRITIQUES for each attempt
Given this information, refine your optimization:
- If compiled=False: fix compilation errors (refer to best-performing solution for cues)
- If correctness=False: fix logic errors
- If correct: reduce runtime below the best-performing solution so far
IMPLEMENT CUDA OPERATORS using:
from torch.utils.cpp_extension import load_inline
"""
CRITIQUE_PROMPT = """
Given the optimization attempt and execution feedback:
1. Diagnose: What are the performance bottlenecks? (memory bandwidth? compute utilization? algorithmic inefficiency?)
2. Suggest: Provide actionable steps to fix or improve performance
- If compile error: explain why it fails and how to fix
- If correct but slow: identify specific bottleneck and optimization strategy
"""
# Usage example (pseudo-code)
def maxcode_loop(initial_code, max_depth=8, k_candidates=8):
best_code, best_speedup = initial_code, 1.0
trajectory = []
for depth in range(max_depth):
# 1. Generate k candidates
candidates = [llm.generate(GENERATOR_PROMPT,
arch=initial_code,
best=(best_code, best_speedup),
trajectory=trajectory)
for _ in range(k_candidates)]
# 2. Execute and measure speedup
results = [execute_and_measure(c) for c in candidates]
# 3. Generate critiques
critiques = [llm.generate(CRITIQUE_PROMPT, code=c, feedback=r)
for c, r in zip(candidates, results)]
# 4. Select the fastest candidate
best_idx = max(range(k_candidates), key=lambda i: results[i].speedup)
if results[best_idx].speedup > best_speedup:
best_code = candidates[best_idx]
best_speedup = results[best_idx].speedup
trajectory.append((candidates[best_idx], results[best_idx], critiques[best_idx]))
return best_code, best_speedupTerminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.