쿼리별 Prompt Routing: LLM을 위한 쿼리 단위 프롬프트 선택

Querywise Prompt Routing for Large Language Models

Jan 1, 2026•Pankaj Singh•View PDF

TL;DR Highlight

같은 질문이라도 최적 프롬프트가 다르다 — 쿼리마다 가장 잘 맞는 프롬프트를 자동으로 골라주는 라우팅 기법

Who Should Read

LLM 기반 서비스에서 프롬프트를 여러 버전 관리하거나 A/B 테스트를 운영 중인 백엔드/AI 개발자. 특히 추론(reasoning) 태스크에서 zero-shot 프롬프트 성능을 올리고 싶은 개발자.

Core Mechanics

프롬프트를 전체 트래픽에 동일하게 쓰는 대신, 쿼리 하나하나에 맞는 프롬프트를 골라주는 라우터를 학습
과거 프롬프트-응답 로그만 있으면 학습 가능 — 정답 레이블(gold answer)이나 추가 LLM 호출 불필요
선호도 모델(preference model)을 오프라인으로 학습해 inference 시 LLM 호출 없이 즉시 프롬프트 점수 산출
N개 후보 프롬프트 중 최고 점수(best-of-N)를 선택하는 구조라 후보를 늘릴수록 성능 향상 가능
특정 LLM 아키텍처에 종속되지 않아 ChatGPT, Llama 등 대화형 LLM이면 그대로 적용 가능
학습 때 못 본 새 프롬프트와 새 쿼리에도 일반화되어 프롬프트 풀을 지속 확장 가능

Evidence

표준 산술 추론 벤치마크에서 쿼리 무관(distribution-level) 프롬프팅 대비 일관된 정확도 향상
confidence 기반 선택기(confidence-based selector)보다 우수한 성능을 여러 LLM 규모에서 확인
ablation 실험에서 학습된 리워드가 미학습 프롬프트·쿼리 모두에 일반화됨을 확인

How to Apply

기존 프롬프트 로그(쿼리 + 프롬프트 + 응답)가 있다면 preference 모델을 오프라인으로 학습해 새 쿼리 유입 시 best-of-N 프롬프트를 자동 선택하도록 라우터 레이어를 추가
프롬프트 A/B 테스트를 운영 중이라면 각 프롬프트의 히스토리 로그를 학습 데이터로 써서 라우터를 초기화하고, 이후 신규 프롬프트 버전을 후보 풀에 추가하기만 하면 됨
추론(reasoning) 외 태스크에서도 프롬프트-응답 로그만 확보되면 동일 파이프라인 적용 가능 — 모델 fine-tuning 없이 프롬프트 레벨에서 성능 개선이 필요한 상황에 유용

Code Example

snippet

# 쿼리별 프롬프트 라우팅 개념 구현 예시 (pseudo-code)

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
import numpy as np

# 1. 과거 로그에서 (query, prompt, score) 쌍 준비
# score: 응답 품질 기반 preference label (예: 0 or 1)
logs = [
    {"query": "3 + 5 * 2는?", "prompt": "단계별로 계산하세요.", "score": 1},
    {"query": "3 + 5 * 2는?", "prompt": "바로 답하세요.", "score": 0},
    # ... 더 많은 로그
]

encoder = SentenceTransformer("all-MiniLM-L6-v2")

# 2. (query + prompt) 쌍을 임베딩해 preference 모델 학습
X = []
y = []
for log in logs:
    pair_text = log["query"] + " [SEP] " + log["prompt"]
    X.append(encoder.encode(pair_text))
    y.append(log["score"])

reward_model = LogisticRegression()
reward_model.fit(np.array(X), y)

# 3. 새 쿼리에 대해 best-of-N 프롬프트 선택
def route_prompt(query: str, candidate_prompts: list[str]) -> str:
    scores = []
    for prompt in candidate_prompts:
        pair_text = query + " [SEP] " + prompt
        emb = encoder.encode(pair_text).reshape(1, -1)
        score = reward_model.predict_proba(emb)[0][1]  # positive class prob
        scores.append(score)
    best_idx = int(np.argmax(scores))
    return candidate_prompts[best_idx]

# 사용 예
candidates = [
    "단계별로 풀어서 최종 답을 알려주세요.",
    "수식을 그대로 계산해 숫자만 답하세요.",
    "먼저 연산 순서를 확인한 뒤 계산하세요.",
]
query = "(12 / 4) + 3 * 7의 결과는?"
best_prompt = route_prompt(query, candidates)
print(f"선택된 프롬프트: {best_prompt}")

Terminology

Prompt Routing여러 프롬프트 중 현재 질문에 가장 잘 맞는 프롬프트를 자동으로 골라주는 기술. 콜센터에서 문의 유형별로 담당자를 연결해주는 것과 비슷.

Preference Model두 응답(또는 프롬프트) 중 어느 쪽이 더 좋은지 학습한 모델. 사람의 '이게 더 낫다' 선호를 수치화한 것.

Best-of-NN개 후보를 모두 점수 매겨 가장 높은 것을 선택하는 방법. 시험지 여러 장 풀고 가장 잘 된 답안 제출하는 것과 같은 원리.

Zero-shot예시 없이 지시문만으로 모델에게 태스크를 수행시키는 방식. 처음 보는 문제를 예제 없이 바로 푸는 것.

Distribution-level Prompting모든 쿼리에 동일한 프롬프트를 쓰는 방식. 쿼리 특성을 무시하고 평균적으로 가장 좋은 프롬프트 하나를 고정해 사용.

Proxy Reward실제 정답 없이 품질을 간접적으로 측정하는 점수. 실제 시험 점수 대신 모의고사 점수로 실력을 가늠하는 것과 비슷.

Ablation모델의 특정 구성요소를 제거하거나 변경해 그게 성능에 얼마나 기여하는지 확인하는 실험.

Original Abstract (Expand)

This paper treats prompt choice as a per-query decision problem for large language models, learning an of-fline proxy reward that can score query-prompt pairs without additional model calls or access to gold answers at inference time. Using prior prompt-response logs as demonstrations, the method trains a preference model over prompts and then selects a best-of-N instruction per query to boost arithmetic reasoning accuracy under strict zero-shot conditions. The pipeline reduces interaction cost by shifting evaluation and optimization offline, while preserving the natural-language prompt space so the approach remains model-agnostic and immediately deployable across chat-oriented LLMs. Experiments on standard reasoning benchmarks show consistent gains over distribution-level, query-agnostic prompting and over confidence-based selectors, with improvements holding across multiple LLM scales. Ablations confirm that the learned reward generalizes to unseen prompts and queries, enabling robust prompt routing at inference without additional gradient updates or tool-specific supervision.