형식화하되 최적화하지 마라: LLM이 생성하는 Combinatorial Solver의 Heuristic Trap

TL;DR Highlight

LLM에게 조합 최적화 문제의 solver를 만들게 할 때, 'Python + OR-Tools'가 가장 정확하고 '효율 최적화' 프롬프트는 오히려 정확도를 망친다.

Who Should Read

LLM을 활용해 최적화 문제(스케줄링, 배정, 탐색 등)를 자동으로 풀게 하는 시스템을 개발하는 백엔드/AI 엔지니어. 특히 LLM이 코드를 생성해서 CP(Constraint Programming) solver를 구동하는 파이프라인을 고민하는 개발자.

Core Mechanics

100개 조합 최적화 문제(4,577 인스턴스)로 구성된 CP-SynC-XL 벤치마크를 새로 공개했고, GPT-5.3-CODEX, GEMINI 3.1 PRO, DEEPSEEK-V3.2 세 LLM으로 실험했다.
세 가지 solver 생성 방식(순수 Python, Python + OR-Tools API, MiniZinc 선언형 모델링) 중 Python + OR-Tools가 모든 LLM에서 정확도 1위였다. 같은 OR-Tools 백엔드를 쓰는 MiniZinc보다 높은 이유는 LLM이 Python API에 더 익숙하기 때문.
MiniZinc + OR-Tools는 절대적 정확도는 낮지만 '답을 낼 때만큼은' 가장 신뢰도(fidelity)가 높다(≥83%). 반면 순수 Python은 답을 내도 틀린 경우가 가장 많다(fidelity 최저 51.9%).
'효율적으로 짜라'는 heuristic 프롬프트는 중간 속도 향상이 겨우 1.03~1.12배에 불과하고, DEEPSEEK-V3.2에서는 오히려 정확도가 최대 7.4pp, fidelity가 16.1pp 급락했다.
heuristic 프롬프트가 유발하는 6가지 실패 패턴(A~F)을 코드 수준에서 분류했다: 검증 안 된 tight bound 삽입(A), propagation을 약화시키는 Big-M 재작성(B), 하드코딩된 초기해(C), 완전탐색을 불완전 탐색으로 대체(D), COP를 SAT로 붕괴(E), 과도한 중복 제약 추가(F).
LLM이 문제를 '공식화(formalize)'하는 단계에서 발생하는 기본 오류도 4가지 패턴으로 분류했다: 문제 이름만 보고 템플릿 적용(M1), 긴 명세에서 일부 제약 누락(M2), 출력 스키마 불일치(M3), 센티널 값 오독(M4). M2와 M3이 전체 오류의 약 74%를 차지한다.

Evidence

Python + OR-Tools baseline 정확도: DEEPSEEK-V3.2 43.9%, GPT-5.3-CODEX 70.2%, GEMINI 3.1 PRO 71.6%로 모든 LLM에서 세 패러다임 중 최고치 기록.
heuristic 프롬프트가 DEEPSEEK-V3.2 Python + OR-Tools에서 정확도 −7.4pp, fidelity −16.1pp로 가장 큰 손실 발생. 반면 GPT-5.3-CODEX는 모든 패러다임에서 ±3.6pp 이내로 안정적.
Mode E(MiniZinc에서 최적화 문제를 만족 문제로 잘못 변환)는 발생 빈도 0.7%로 드물지만, 발생 시 평균 정확도 −33.4pp로 가장 치명적. GPT-5.3-CODEX problem 14에서 정확도 0.75 → 0.083으로 추락.
Mode D의 최악 사례: 순수 Python에서 GPT-5.3-CODEX가 anti-magic square 문제에서 완전탐색을 편향된 로컬서치로 교체하자 정확도 0.826 → 0.087로 급락. 벤치마크 전체에서 단일 문제 최대 하락폭.

How to Apply

LLM으로 최적화 solver를 자동 생성할 때는 'Python + OR-Tools CP-SAT API 방식'을 기본으로 선택하라. 같은 OR-Tools 백엔드를 쓰더라도 MiniZinc보다 LLM의 Python API 숙련도가 훨씬 높아 정확도가 일관되게 높다.
효율 향상을 위한 프롬프트(예: '빠르게 풀어라', '가지치기를 추가해라')는 강한 LLM(GPT-5.3-CODEX)에만 제한적으로 사용하고, DEEPSEEK-V3.2 같은 약한 모델에는 기본 정확도 프롬프트만 사용하라. heuristic 프롬프트 후에는 반드시 독립적인 검증 단계를 추가하라.
LLM이 생성한 solver에서 다음 코드 패턴이 보이면 즉시 검토하라: ① 주석으로만 주장되는 bound/dominance 값 ② solve minimize가 solve satisfy로 바뀐 MiniZinc 코드 ③ max_iter나 node_budget으로 탐색을 강제 종료하는 Python 코드. 이 패턴들은 논문에서 정의한 heuristic trap의 핵심 신호다.

Code Example

snippet

# Python + OR-Tools 방식 기본 프롬프트 구조 (논문 권장 방식)
system_prompt = """
You are an expert constraint programming assistant.
Write a Python function that solves the given combinatorial problem using OR-Tools CP-SAT.

def solve_instance(data_dict) -> dict:
    ...
    return solution

Requirements:
- Use ortools.sat.python.cp_model ONLY. No pure-Python fallback.
- Do NOT set internal time limits (max_time_in_seconds, etc.).
- Focus on correctness: encode variables, constraints, and objectives faithfully.
- If OR-Tools is unavailable, raise RuntimeError immediately.
"""

# 피해야 할 heuristic 프롬프트 패턴 (논문에서 위험하다고 밝힌 것들)
bad_heuristic_prompt = """
# 아래 패턴들은 heuristic trap을 유발함:
# 1. tight bound 주장 (검증 없이)
# excess = S * P - D
# zub = max(0, min(S * P, d[i] + excess))  # 'claimed dominance bound' - 위험!

# 2. 완전탐색을 로컬서치로 교체
# max_iter = 100000  # 탐색을 강제로 잘라냄 - UNSAT 유발
# for _ in range(max_iter): ...

# 3. MiniZinc에서 최적화를 만족으로 바꾸기
# solve satisfy;  # 원래 solve minimize였으면 절대 금지!
"""

# 권장: 검증 파이프라인 추가
def validate_solver_code(code: str) -> list[str]:
    """heuristic trap 신호 탐지"""
    warnings = []
    if 'max_time_in_seconds' in code:
        warnings.append('[Mode A] 내부 시간 제한 감지 - 외부 프레임워크에 위임하세요')
    if 'max_iter' in code or 'node_budget' in code:
        warnings.append('[Mode D] 반복 횟수 제한 감지 - 완전탐색이 불완전해질 수 있음')
    if 'solve satisfy' in code and 'minimize' not in code:
        warnings.append('[Mode E] COP가 SAT로 변환됐는지 확인 필요')
    if code.count('\n') > 150:  # 너무 긴 MiniZinc 모델
        warnings.append('[Mode F] 모델이 매우 큼 - 과도한 보조 제약 추가됐을 가능성')
    return warnings

Terminology

Combinatorial Optimization경우의 수가 폭발적으로 많은 문제에서 최적 해를 찾는 것. 예: 배달 경로 최적화, 시간표 배정. 컴퓨터가 모든 경우를 다 시도하면 우주 나이보다 오래 걸릴 수 있음.

CP-SATGoogle OR-Tools에 포함된 제약 만족/최적화 전용 solver. 변수와 제약을 선언하면 알아서 해를 탐색해주는 엔진. 직접 탐색 알고리즘을 짜는 것보다 훨씬 안정적.

OR-ToolsGoogle이 만든 오픈소스 최적화 라이브러리. CP-SAT solver를 Python API로 쉽게 사용할 수 있게 해줌.

MiniZinc제약 프로그래밍용 선언형 모델링 언어. SQL처럼 '무엇을 원하는지'만 적으면 solver가 알아서 해를 찾아줌. 단, LLM이 Python보다 덜 익숙함.

Fidelitysolver가 답을 '냈을 때' 그 답이 실제로 맞는 비율. '답을 많이 내는 것'과 '맞는 답을 내는 것'을 구분하는 지표.

Heuristic TrapLLM에게 '더 빠르게 풀어라'고 요청했을 때, LLM이 검증되지 않은 추측성 최적화를 추가해서 오히려 정답률이 떨어지는 현상.

NP-hard풀기 어려운 문제의 난이도 분류. 입력 크기가 조금만 커져도 계산량이 기하급수적으로 늘어나는 문제들. 외판원 문제, 스케줄링 등이 여기에 해당.

Fidelity (Conditional)정답을 '제출했을 때'만 조건부로 계산한 정확도. 타임아웃으로 포기한 경우는 제외하고, 제출한 답 중 맞은 비율만 따짐.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize executable solvers. A central design question is how the LLM should represent the solver, and whether it should also attempt to optimize search. We introduce CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), and evaluate three solver-construction paradigms: native algorithmic search (Python), constraint modeling through a Python solver API (Python + OR-Tools), and declarative constraint modeling (MiniZinc + OR-Tools). We find a consistent representational divergence: Python + OR-Tools attains the highest correctness across LLMs, while MiniZinc + OR-Tools has lower absolute coverage despite using the same OR-Tools back-end. Native Python is the most likely to return a schema-valid solution that fails verification, whereas solver-backed paths preserve higher conditional fidelity. On the heuristic axis, prompting for search optimization yields only small median speed-ups (1.03-1.12x) and a strongly bimodal effect: many instances slow down, and correctness drops sharply on a long tail of problems. A paired code-level audit traces these regressions to a recurring heuristic trap. Under an efficiency-oriented prompt, the LLM may replace complete search with local approximations (Python), inject unverified bounds (Python + OR-Tools), or add redundant declarative machinery that overwhelms or over-constrains the model (MiniZinc + OR-Tools). These findings support a conservative design principle for LLM-generated combinatorial solvers: use the LLM primarily to formalize variables, constraints, and objectives for verified solvers, and separately check any LLM-authored search optimization before use.