FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents | AI Paper Digest

TL;DR Highlight

고정된 파이프라인 대신 추론 중 언제든 DB를 탐색·실행할 수 있는 Text-to-SQL 에이전트로 Spider2.0 벤치마크에서 gpt-o3, DeepSeek-R1 기반 시스템을 더 작은 모델로 능가

Who Should Read

Snowflake, BigQuery 같은 대규모 데이터 웨어하우스에서 자연어 → SQL 변환 파이프라인을 구축하거나 개선하려는 백엔드/데이터 엔지니어. LLM 기반 DB 쿼리 자동화 도구를 만드는 AI 엔지니어에게도 유용.

Core Mechanics

기존 Text-to-SQL 시스템은 스키마를 한 번만 조회하고 고정된 순서로 SQL을 생성하는데, FlexSQL은 추론 과정 어느 시점에서든 GetSchema, GetTableCol, GetColValues, FindRows, SQLExecutor, PythonExecutor 총 6개 도구로 DB를 자유롭게 탐색·실행 가능
자연어 쿼리의 모호성을 다루기 위해 diversity-enforced sampling(의도적으로 서로 다른 해석을 강제하는 샘플링)으로 K개의 플랜을 병렬 생성하고, 실행 결과 기준 majority voting으로 최종 답변 선택
각 플랜을 SQL 또는 Python 중 더 적합한 언어로 구현하는 bilingual generation 지원 — Python이 독점적으로 문제를 푸는 경우가 Spider2-SQLite에서 최대 27.6%에 달함
오류 발생 시 코드 레벨 수정(최대 3회)만 아니라, 잘못된 테이블 선택이나 플랜 자체 오류면 Plan Generation 단계로 backtracking해 스키마 재탐색 후 플랜 전체를 다시 작성하는 2단계 repair 메커니즘 보유
Python으로 생성된 최종 답변은 Python-to-SQL 트랜스파일러로 SQL로 변환 — gpt-oss-20b가 87%를 처리하고 나머지를 gpt-oss-120b가 처리해 전체 96.77% 변환 성공률 달성
FlexSQL의 탐색 패러다임을 Claude Code에 skill로 패키징해 적용했을 때 Spider2-Snow Pass@1이 58.3% → 66.0%로 7.7%p 절대 향상, 범용 코딩 에이전트에도 적용 가능성 확인

Evidence

Spider2-Snow에서 gpt-oss-120b K=16 기준 Majority@K 65.44% 달성 — DSR-SQL + DeepSeek-R1(63.80%), ReFoRCE + gpt-o3(62.89%) 모두 능가
Spider2-SQLite에서 gpt-oss-20b의 Pass@1 50.37%가 gpt-oss-120b 기반 ReFoRCE(45.19%), DSR-SQL(48.15%) 두 베이스라인을 모두 초과 — 더 작은 모델로 더 큰 모델 기반 시스템 제압
Python 인터프리터 제거 시 Majority@8이 120b 기준 64.44% → 52.59%(-11.85%p), 20b 기준 54.07% → 42.22%(-11.85%p)로 가장 큰 성능 하락 발생
테이블 레벨 스키마 링킹 F1에서 FlexSQL + oss-120b Best of 8이 95.26%로, ReFoRCE + o4+o3+o4 mini(80.03%), DSR-SQL + DeepSeek-V3(82.65%)를 크게 앞서며 precision 95.46% 기록

How to Apply

LLM에게 DB 메타데이터 전체를 컨텍스트로 한번에 주는 대신, 스키마 이름 목록만 초기값으로 주고 GetSchema → GetTableCol → GetColValues 순서로 필요할 때마다 호출하도록 tool-use 프롬프트를 설계하면 대형 DB에서 컨텍스트 오버플로우 없이 정확한 스키마 링킹 가능
SQL 생성 시 모호한 쿼리에는 하나의 해석만 시도하지 말고 'Plan A: 테이블 X만 사용, Plan B: 테이블 X+Y 조인, Plan C: 추가 테이블 포함' 식으로 K개 플랜을 생성한 뒤 각각 실행하고 결과가 동일한 그룹에 투표하는 majority voting 패턴을 적용
복잡한 집계나 반복 로직이 필요한 쿼리는 SQL 직접 생성 대신 Python(pandas, numpy)으로 먼저 구현하고 결과가 검증되면 SQL로 트랜스파일하는 2단계 접근법을 파이프라인에 추가 — 특히 재귀적 상태 의존 계산이나 다단계 집계에서 효과적

Code Example

snippet

# FlexSQL 스타일 Text-to-SQL 에이전트 프롬프트 패턴 예시

SYSTEM_PROMPT = """
You are a text-to-SQL agent. You have access to these tools:
- GetSchema(schema_name): List all tables in a schema
- GetTableCol(table_name): Get columns and sample values
- GetColValues(column, table): Get distinct values in a column
- FindRows(term, column, table): Keyword search in column
- SQLExecutor(sql): Execute SQL and return results
- PythonExecutor(code): Execute Python with DB access

Rules:
1. Start with only schema names. Explore incrementally.
2. ALWAYS inspect actual column values before writing filters.
3. Generate K=3 diverse plans covering different table choices.
4. For each plan, choose SQL or Python based on complexity.
5. If execution fails with plan-level error, backtrack and re-explore.
"""

USER_QUERY = "Find patents filed in Q1 2014 in materials science"

# 에이전트가 따를 탐색 순서 예시:
# Step 1: GetSchema('PATENT_DB') → 테이블 목록 확인
# Step 2: GetTableCol('TECH_CLASS') → 컬럼 구조 파악
# Step 3: GetColValues('field_code', 'TECH_CLASS') → 실제 값 확인 (MS-01~MS-09)
# Step 4: SQLExecutor('SELECT COUNT(*) FROM FILING_INFO WHERE filing_date BETWEEN ...')
# Step 5: 3개 플랜 생성 (인용 테이블 범위별)
# Step 6: 각 플랜 SQL/Python으로 구현
# Step 7: Majority voting으로 최종 답 선택

# Diversity-enforced sampling 프롬프트 예시
DIVERSITY_PROMPT = """
Generate plan {k} that is DIFFERENT from these existing plans:
{existing_plans}

Explore alternative:
- Different table combinations
- Different join paths  
- Different interpretations of ambiguous terms
"""

Terminology

Text-to-SQL자연어 질문을 SQL 쿼리로 자동 변환하는 기술. '2024년 1분기 매출 상위 10개 제품 보여줘'라고 하면 자동으로 SELECT 문을 작성해주는 것.

Schema Linking자연어 질문에서 어떤 테이블/컬럼이 필요한지 매핑하는 과정. 질문에 '재료 과학'이 나왔을 때 DB의 어떤 컬럼에 해당하는지 찾는 작업.

Majority Voting같은 질문에 여러 답을 생성하고 가장 많이 나온 결과를 정답으로 선택하는 방식. 선거에서 다수결로 당선자를 고르는 것과 같은 원리.

Diversity-enforced SamplingLLM이 비슷한 답만 반복 생성하는 mode collapse를 막기 위해, '이미 나온 것과 다른 아이디어를 내라'고 명시적으로 지시하며 샘플링하는 기법.

Bilingual Generation여기서는 SQL과 Python 두 언어 중 문제에 더 적합한 것을 골라 코드를 생성하는 방식. SQL이 어려운 반복/조건 로직은 Python으로 먼저 풀고 나중에 SQL로 변환.

Plan Backtracking코드 수정으로 고칠 수 없는 근본적인 오류(잘못된 테이블 선택 등)가 발견되면 계획 단계로 되돌아가 처음부터 다시 설계하는 메커니즘. GPS가 경로 재탐색하는 것과 유사.

Pass@KK개의 답을 생성했을 때 그 중 적어도 하나가 정답인 비율. K=8이면 8번 시도 중 1번 이상 맞출 확률.

Majority@KK개의 답 중 가장 많이 나온 결과를 최종 답으로 선택했을 때의 정확도. Pass@K보다 실전적인 평가 지표.

Related Papers

Related Resources

Original Abstract (Expand)

Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, limiting recovery from early mistakes. We present FlexSQL, a text-to-SQL agent whose core design principle is flexible database interaction: the agent can explore schema structure, inspect data values, and run verification queries at any point during reasoning. FlexSQL generates diverse execution plans to cover multiple query interpretations, implements each plan in either SQL or Python depending on the task, and uses a two-tiered repair mechanism that can backtrack from code-level errors to plan-level revisions. On Spider2-Snow, using gpt-oss-120b, FlexSQL achieves a 65.4\% score, outperforming strong open-source baselines that use stronger, larger models such as gpt-o3 and DeepSeek-R1. When integrated into a general-purpose coding agent (as skills in Claude Code), our approach yields over 10\% relative improvement on Spider2-Snow. Further analysis shows that flexible exploration and flexible execution jointly contribute to the effectiveness of our approach, highlighting flexibility as a key design principle. Our code is available at: https://github.com/StringNLPLAB/FlexSQL