Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling
TL;DR Highlight
Using optimization solvers like Gurobi as verifiers for RL training — a 32B model beats DeepSeek-V3 and OpenAI-o3.
Who Should Read
ML engineers building systems that auto-convert natural language problems into mathematical optimization models for supply chain, logistics, or finance. AI researchers looking to build RLVR pipelines with external tool verification.
Core Mechanics
- Full pipeline trained with RL: natural language problem → mathematical model → executable Gurobi code, using solver execution results (executability, optimal value, .lp file structure) directly as rewards
- Partial KL: a new surrogate function that applies no KL penalty to reasoning steps but penalizes only the formula/code generation steps — prevents reward hacking while preserving reasoning flexibility
- Ensemble inference uses .lp file structural features (variable types, constraint counts) instead of simple majority voting for answer selection
Evidence
- SIRL-Qwen2.5-32B Macro Average 68.3% vs DeepSeek-V3.1 65.0%, DeepSeek-R1 64.6%, OpenAI-o3 57.1% (5-benchmark average)
- Partial KL vs Without KL: IndustryOR execution success 96.0% vs 87.0%, OptMATH 92.2% vs 80.1%
- Significant improvements in both execution reliability and solution accuracy
How to Apply
- With a Gurobi/CPLEX license: auto-execute LLM-generated optimization code and define rewards as execution success + objective function accuracy to build an RLVR pipeline. Code and prompt templates available on the paper's GitHub.
- For ensemble inference, compare .lp file structural features (variable types, constraint counts) instead of simple answer majority voting for more robust selection.
Code Example
# SIRL system prompt structure (as in the paper)
SYSTEM_PROMPT = """
You are a helpful Assistant with expertise in operations research and the Gurobi solver.
When the User provides an OR question, you will analyze it, build a detailed mathematical model,
and provide the Gurobi code to solve it.
Your response should follow these steps:
1. <think> Carefully analyze the problem to identify decision variables, objective, and constraints.</think>
2. <model> Develop a complete mathematical model, explicitly defining:
* Sets * Parameters * Decision Variables (and their types) * Objective Function * Constraints </model>
3. <python> Provide the corresponding Gurobi Python code to implement the model. </python>
The output must be in Markdown format, with each step enclosed in the specified tags.
"""
# Instance-Enhanced Self-Consistency scoring example
def instance_enhanced_score(roles_results):
"""
roles_results: list of dict with keys:
- obj_value: float (optimal objective function value)
- direction: str ('max' or 'min')
- n_binary: int (number of binary variables)
- n_integer: int (number of integer variables)
"""
from collections import Counter
obj_values = [r['obj_value'] for r in roles_results]
directions = [r['direction'] for r in roles_results]
n_binaries = [r['n_binary'] for r in roles_results]
n_integers = [r['n_integer'] for r in roles_results]
obj_counter = Counter(obj_values)
dir_counter = Counter(directions)
bin_counter = Counter(n_binaries)
int_counter = Counter(n_integers)
scores = []
for r in roles_results:
# w1=w2=w3=w4=1 (paper setting)
score = (
obj_counter[r['obj_value']] ** 0.5 + # p=0.5 (sqrt)
dir_counter[r['direction']] ** 0.5 +
bin_counter[r['n_binary']] ** 0.5 +
int_counter[r['n_integer']] ** 0.5
)
scores.append(score)
best_idx = scores.index(max(scores))
return roles_results[best_idx]
# Reward function structure (2 stages)
def compute_reward(response, ground_truth, stage=1):
r_format = 0.5 if all(tag in response for tag in ['<think>', '<model>', '<python>']) else 0
r_exec = 1.0 if is_executable(response) else 0 # actual code execution
r_accur = 2.0 if abs(predicted - ground_truth) / abs(ground_truth) < 1e-6 else 0
if stage == 1:
return r_format + r_exec + r_accur
else: # stage 2
r_bonus = 1.0 if uses_advanced_modeling(response) else 0 # .lp file analysis
return r_format + r_exec + r_accur + r_bonusTerminology
Related Resources
Original Abstract (Expand)
Optimization modeling is fundamental to decision-making across diverse domains. Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models against hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL), a novel framework that significantly improves the authenticity of LLMs for optimization modeling using Reinforcement Learning with Verifiable Reward by leveraging external optimization solvers as verifiers. These verifiers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals -- including syntax, feasibility, and solution quality, serving as direct rewards for the RL process. This automated verification process, particularly from classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models. Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.