Solver-Informed RL: 최적화 모델링을 위한 LLM Grounding 프레임워크

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

May 17, 2025•Yitian Chen, Jingfan Xia, Siyu Shao +2•View PDF

TL;DR Highlight

Gurobi 같은 최적화 솔버를 검증기로 활용해 RL로 LLM을 학습시키면, 32B 모델이 DeepSeek-V3와 OpenAI-o3를 이긴다.

Who Should Read

공급망, 물류, 금융 등 도메인에서 자연어 문제를 수학적 최적화 모델로 자동 변환하는 시스템을 개발 중인 ML 엔지니어. 또는 LLM에 외부 도구 검증 피드백을 연결한 RLVR 파이프라인을 구축하려는 AI 연구자.

Core Mechanics

자연어 문제 → 수식 모델 → 실행 가능한 Gurobi 코드 생성까지 전 과정을 RL로 학습. 솔버 실행 결과(실행 가능 여부, 최적값, .lp 파일 구조)를 reward로 직접 사용
Partial KL이라는 새 surrogate function 설계: reasoning 단계에는 KL penalty를 안 걸고, 수식/코드 생성 단계에만 KL penalty 적용. KL 없애면 코드 실행 성공률이 80%대로 뚝 떨어짐
Instance-Enhanced Self-Consistency: 정답값만으로 투표하는 대신, .lp 파일에서 목적함수 방향(최대/최소), 이진변수 수, 정수변수 수까지 추출해 앙상블 투표에 반영. OptMATH에서 val_sc 대비 128% 개선
2단계 curriculum 학습: 1단계는 기본 실행 가능성/정확도 reward, 2단계는 Big-M, 비선형 모델링 같은 고급 기법 사용 시 bonus reward 추가
SIRL-Qwen2.5-32B가 Macro Average 68.3%로 DeepSeek-V3.1(65.0%), DeepSeek-R1(64.6%), OpenAI-o3(57.1%)를 모두 능가. 671B 모델을 32B로 이김
SFT 대비 RL이 확실히 우위: MAMO Complex에서 SFT 38.0% → RL 51.7% (13.7%p 향상), OptMATH에서 SFT 20.8% → RL 30.5% (9.7%p 향상)

Evidence

SIRL-Qwen2.5-32B Macro Average 68.3% vs DeepSeek-V3.1 65.0%, DeepSeek-R1 64.6%, OpenAI-o3 57.1% (5개 벤치마크 평균)
Partial KL vs Without KL 비교: IndustryOR 실행 성공률 96.0% vs 87.0%, OptMATH 92.2% vs 80.1% (Without KL은 코드가 아예 실행 안 되는 경우가 급증)
Instance-Enhanced Self-Consistency(inst_sc@5) vs 단순 값 투표(val_sc@5): OptMATH에서 7B 모델 기준 5.7% → 13.0% (128% 개선), 32B 기준 22.3% → 27.5% (23% 개선)
순수 텍스트 추론(솔버 없음) 대비: Qwen2.5-7B-Instruct 기준 NL4OPT 24.5% → SIRL 96.3%, OptMATH 5.7% → 30.5%로 각각 4배, 5배 이상 향상

How to Apply

Gurobi/CPLEX 라이선스가 있는 환경에서 LLM이 생성한 최적화 코드를 자동 실행하고, 실행 성공 여부 + 목적함수값 정확도를 reward로 정의해 RLVR 파이프라인 구축 가능. 논문 GitHub에 공개된 코드와 프롬프트 템플릿을 그대로 사용할 수 있음
앙상블 추론이 필요한 경우, 단순히 정답값 다수결 대신 .lp 파일에서 변수 타입/수 정보를 추출해 가중 투표에 포함시키면 된다. 특히 복잡한 MILP(정수/이진변수 혼합) 문제에서 효과가 큼
Fine-tuning 학습 데이터 구축 시 '너무 쉬운 샘플 제거' 전략 적용: 기준 모델(Qwen-32B)이 10번 중 8번 이상 맞히는 문제는 학습 데이터에서 제외하면 70,000개 → 10,000개로 줄여도 성능이 올라감

Code Example

snippet

# SIRL 시스템 프롬프트 구조 (논문 그대로)
SYSTEM_PROMPT = """
You are a helpful Assistant with expertise in operations research and the Gurobi solver.
When the User provides an OR question, you will analyze it, build a detailed mathematical model,
and provide the Gurobi code to solve it.

Your response should follow these steps:
1. <think> Carefully analyze the problem to identify decision variables, objective, and constraints.</think>
2. <model> Develop a complete mathematical model, explicitly defining:
   * Sets * Parameters * Decision Variables (and their types) * Objective Function * Constraints </model>
3. <python> Provide the corresponding Gurobi Python code to implement the model. </python>

The output must be in Markdown format, with each step enclosed in the specified tags.
"""

# Instance-Enhanced Self-Consistency 스코어링 예시
def instance_enhanced_score(roles_results):
    """
    roles_results: list of dict with keys:
      - obj_value: float (목적함수 최적값)
      - direction: str ('max' or 'min')
      - n_binary: int (이진변수 개수)
      - n_integer: int (정수변수 개수)
    """
    from collections import Counter
    
    obj_values = [r['obj_value'] for r in roles_results]
    directions = [r['direction'] for r in roles_results]
    n_binaries = [r['n_binary'] for r in roles_results]
    n_integers = [r['n_integer'] for r in roles_results]
    
    obj_counter = Counter(obj_values)
    dir_counter = Counter(directions)
    bin_counter = Counter(n_binaries)
    int_counter = Counter(n_integers)
    
    scores = []
    for r in roles_results:
        # w1=w2=w3=w4=1 (논문 설정)
        score = (
            obj_counter[r['obj_value']] ** 0.5 +  # p=0.5 (sqrt)
            dir_counter[r['direction']] ** 0.5 +
            bin_counter[r['n_binary']] ** 0.5 +
            int_counter[r['n_integer']] ** 0.5
        )
        scores.append(score)
    
    best_idx = scores.index(max(scores))
    return roles_results[best_idx]

# Reward 함수 구조 (2단계)
def compute_reward(response, ground_truth, stage=1):
    r_format = 0.5 if all(tag in response for tag in ['<think>', '<model>', '<python>']) else 0
    r_exec = 1.0 if is_executable(response) else 0  # 실제 코드 실행
    r_accur = 2.0 if abs(predicted - ground_truth) / abs(ground_truth) < 1e-6 else 0
    
    if stage == 1:
        return r_format + r_exec + r_accur
    else:  # stage 2
        r_bonus = 1.0 if uses_advanced_modeling(response) else 0  # .lp 파일 분석
        return r_format + r_exec + r_accur + r_bonus

Terminology

RLVR정답이 맞는지 틀린지 검증 가능한 외부 도구(컴파일러, 솔버 등)의 피드백으로 LLM을 강화학습하는 방법. 사람이 점수 매기지 않아도 됨.

MILPMixed-Integer Linear Programming. 변수 일부는 정수(0 또는 1 같은)여야 하는 최적화 문제. 공장 가동 여부처럼 '켜거나 끄거나'가 있는 현실 문제에 자주 씀.

.lp file최적화 문제를 표준 텍스트 형식으로 저장한 파일. 어떤 변수가 있고, 제약조건이 뭔지, 목적함수가 뭔지를 솔버가 읽을 수 있는 형태로 저장함. 논문에서는 이 파일로 모델 구조를 검증함.

KL Divergence두 확률분포가 얼마나 다른지 측정하는 값. RL 학습 시 모델이 원래 버전에서 너무 많이 벗어나지 않도록 제약하는 용도로 씀.

Partial KL이 논문에서 새로 제안한 기법. 추론 단계에는 KL 제약을 안 걸고 (자유롭게 생각하게), 코드/수식 생성 단계에만 KL 제약을 걸어서 (형식을 지키게) 안정성과 탐색력을 동시에 잡음.

Self-Consistency같은 문제를 여러 번 풀고 다수결로 답을 고르는 앙상블 기법. 이 논문은 단순 정답값 투표에서 더 나아가 모델 구조 정보까지 포함해 투표함.

Curriculum Learning쉬운 것부터 어려운 것 순서로 학습시키는 방법. 학교에서 덧셈 배우고 나서 미분 배우는 것처럼, 모델도 기초부터 차근차근 학습.

Related Resources

Original Abstract (Expand)

Optimization modeling is fundamental to decision-making across diverse domains. Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models against hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL), a novel framework that significantly improves the authenticity of LLMs for optimization modeling using Reinforcement Learning with Verifiable Reward by leveraging external optimization solvers as verifiers. These verifiers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals -- including syntax, feasibility, and solution quality, serving as direct rewards for the RL process. This automated verification process, particularly from classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models. Our code is publicly available at https://github.com/Cardinal-Operations/SIRL.