TAHOE: Text-to-SQL with Automated Hint Optimization from Experience | AI Paper Digest

TL;DR Highlight

LLM이 SQL 생성 실패에서 배운 힌트를 재사용 가능한 Hint Bank로 쌓아, 모델 재학습 없이 Snowflake 방언 SQL 정확도를 대폭 끌어올리는 시스템.

Who Should Read

자연어로 데이터베이스를 조회하는 Text-to-SQL 기능을 프로덕션에 붙이려는 백엔드/데이터 엔지니어. 특히 Snowflake 같은 특정 SQL 방언 환경에서 LLM 오류를 줄이고 싶은 개발자.

Core Mechanics

SQL 생성 실패를 '문법 오류(Syntax Error)'와 '의미 오류(Semantic Error)'로 분류하고, 각각 컴파일러 피드백과 실행 결과 피드백으로 자동 학습한 힌트를 Hint Bank에 누적한다.
Hint Bank는 모델 파라미터가 아닌 외부 파일처럼 관리되어, GPT-5.5로 학습한 Hint Bank를 Doubao-2.0-lite나 GPT-5에 그대로 꽂아 쓸 수 있는 모델 무관(Model-Agnostic) 구조다.
Strategy Layer(전략 레이어)가 핵심: 같은 자연어 트리거(예: 'log10 변환')에 여러 개의 상충하는 전략을 병렬로 저장하고, 각 전략의 성공률·실패율·지원 사례 수를 추적해 추론 시 최적 전략을 선택한다.
추론 시 두 단계로 진행: 먼저 Logic Planning 단계에서 어떤 전략을 쓸지 결정하고, 이후 SQL Synthesis 단계에서 방언에 맞는 최종 SQL을 생성한다.
Strategy Attribution(전략 귀인) 기능이 없으면 '학습됐지만 실제로는 쓸모없는' 전략이 inference를 방해한다. Attribution을 켜면 pass rate가 69.03% → 79.42%로 +10.39pp 추가 상승한다.
Syntax Hint는 쿼리 무관하게 전이(Transfer)가 잘 되지만, Semantic Hint는 개발 세트가 실제 워크로드를 얼마나 커버하느냐에 따라 효과가 달라진다—held-out 세트에서 syntax는 +8.93pp, semantic은 +1.90pp에 그침.

Evidence

GPT-5.5 기준 pass rate 61.95% → 79.42% (+17.47pp), pass@4 72.57% → 87.61% (+15.04pp), Snowflake 문법 통과율 96.24% → 100% 달성.
평균 컴파일러 피드백 수정 라운드가 2.79 → 0.12로 약 22배 감소: 힌트 없이는 쿼리마다 여러 번 수정하던 것이 힌트 적용 후 거의 첫 시도에 통과.
Doubao-2.0-lite(약한 모델)에서도 GPT-5.5가 학습한 Hint Bank를 그대로 쓰자 pass rate 29.42% → 49.12% (+19.70pp), pass@4 46.02% → 64.60%로 상승—재학습 비용 0.
SQLGenie-style RAG 베이스라인은 held-out 세트에서 Vanilla 대비 pass rate -0.86pp, 문법 통과율 -4.44pp로 오히려 하락—예시 기반 RAG는 유사한 사례가 없으면 노이즈가 됨.

How to Apply

Snowflake나 특정 방언 환경에서 Text-to-SQL을 운영 중이라면, 컴파일 실패 로그를 쌓아두었다가 Syntax Learning Agent 방식으로 방언 규칙(예: 식별자 인용, 대소문자 규칙)을 추출해 Syntax Hint로 저장하면 된다. 이 힌트는 모든 쿼리에 항상 주입하면 되고, 모델을 바꿔도 재사용 가능하다.
비슷한 자연어 표현인데 사용자마다 다른 SQL 로직이 필요한 케이스(예: '상위 제품' → LIMIT 1 vs. 동점 포함 전체)가 있다면, 하나의 트리거에 여러 Strategy를 달아 성공/실패 사례를 추적하는 Strategy Layer 구조를 도입하면 충돌 없이 관리할 수 있다.
SFT나 fine-tuning 없이 모델을 자주 교체해야 하는 환경이라면, Hint Bank를 JSON/DB로 외부화해서 새 모델에 그대로 주입하는 방식을 쓰면 된다. GPT-5.5로 학습한 힌트를 GPT-5나 Doubao에 꽂아도 두 자릿수 pp 개선이 나오는 게 증명됐다.

Code Example

snippet

# Syntax Hint 구조 예시 (Snowflake 식별자 인용 규칙)
syntax_hint = {
    "rule": "Quote every database, schema, table, column, CTE, and alias exactly as it is stored; quote each element of a fully-qualified path separately.",
    "example": {
        "schema": "sales.orders(orderId, order_date)",
        "question": "Count all orders.",
        "correct_sql": 'SELECT COUNT(*) AS "total" FROM "SALES"."ORDERS";'
    }
}

# Semantic Hint 구조 예시 (log10 변환)
semantic_hint = {
    "trigger": "log10 transformation of counts",
    "scope": "General",
    "strategies": [
        {
            "rationale": "When counts can be zero, add 1 before log to avoid -infinity",
            "preferred_action": "Apply log10(column + 1)",
            "preferred_sql": 'LOG(10, "{COLUMN}" + 1)',
            "wrong_action": "Replace zeros with NULL before log (silently drops rows)",
            "wrong_sql": 'LOG(10, NULLIF("{COLUMN}", 0))',
            "recency": "2026-06-01T00:00:00Z",
            "eval_stats": {
                "success_rate": 0.85,
                "harm_rate": 0.05,
                "inert_rate": 0.10,
                "support": 20
            }
        }
    ]
}

# Inference 시 Logic Planning 프롬프트 구조
logic_planning_prompt = """
You are a SQL Logic Planner for Snowflake.

User Question: {question}
Database Schema: {schema}

Retrieved Semantic Hints:
{semantic_hints_with_stats}

Syntax Rules (always apply):
{syntax_hints}

Instructions:
1. Review each strategy's success_rate and harm_rate.
2. Prefer strategies with high success_rate (>0.7) and low harm_rate (<0.1).
3. Ignore strategies with high inert_rate or low support (<5 examples).
4. Produce a Logic Plan describing which strategies to apply and why.

Logic Plan:
"""

Terminology

Text-to-SQL자연어 질문을 SQL 쿼리로 자동 변환하는 기술. '지난달 매출 상위 10개 제품 보여줘'라고 말하면 LLM이 SELECT 문을 대신 써주는 것.

Hint BankLLM이 과거 실패에서 배운 규칙과 전략을 모아둔 외부 지식 저장소. 마치 개발자의 트러블슈팅 위키처럼, 한 번 해결한 문제를 다음번엔 바로 참고할 수 있게 정리해둔 것.

Syntax HintSQL 방언(dialect)의 문법 규칙을 담은 힌트. 예: 'Snowflake에서는 모든 컬럼명을 큰따옴표로 감싸야 한다'는 규칙 하나.

Semantic Hint비즈니스 로직이나 사용자 의도를 담은 힌트. 예: 'GA4 데이터에서 방문자 식별은 USER_ID가 아닌 USER_PSEUDO_ID를 써야 한다'는 도메인 지식.

Strategy Layer같은 질문 패턴에 여러 개의 상충하는 해결책을 병렬로 저장하고 통계로 순위를 매기는 구조. 마치 동료들이 같은 문제에 각자 다른 해법을 제안했을 때, 어떤 해법이 더 자주 성공했는지 기록해두는 것.

SFT모범 답안 데이터를 보여주고 모델 파라미터를 업데이트하는 학습법(Supervised Fine-Tuning). 효과는 좋지만 스키마나 방언이 바뀌면 다시 학습해야 해서 비용이 크다.

pass@kk번 시도 중 최소 1번이라도 정답을 맞히면 성공으로 치는 평가 방식. pass@4는 4번 생성 중 1번이라도 맞으면 OK.

Strategy Attribution각 전략이 실제로 얼마나 도움이 됐는지(success), 방해가 됐는지(harm), 아무 영향이 없었는지(inert)를 사후에 측정해 레이블을 붙이는 과정.

Related Papers

Related Resources

https://spider2-sql.github.io/

Original Abstract (Expand)

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development and Deployment to consolidate debugging traces into a structured Hint Bank. Compiler feedback is distilled into reusable Syntax Hints for dialect-specific rules, while execution and user feedback are converted into Semantic Hints for schema- and user-specific logic. Tahoe further introduces a Strategy Layer that models conflicting user intents as competing strategies under shared natural-language triggers, with recency signals and post-learning attribution statistics that summarize empirical success, harm, inertness, and support. At inference time, Tahoe retrieves relevant hints and guides the LLM through Logic Planning followed by SQL Synthesis. We implement and evaluate the development-phase workflow, leaving deployment-time human-feedback updates for future work. On Spider 2.0-Snow, Tahoe substantially improves Text-to-SQL without updating model parameters. On 113 supervised Spider 2.0-Snow-0212 examples using GPT-5.5, Tahoe raises pass rate from 61.95 percent to 79.42 percent and pass-at-4 from 72.57 percent to 87.61 percent, achieves 100 percent Snowflake syntax pass rate, and reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate. The same Hint Bank also transfers to weaker backbones, including a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite.