DeepRetrieval: Reinforcement Learning으로 실제 검색 엔진과 Retriever를 해킹하는 LLM 기반 쿼리 생성

DeepRetrieval: Hacking Real Search Engines and Retrievers with Large Language Models via Reinforcement Learning

Feb 28, 2025•Pengcheng Jiang, Jiacheng Lin, Lang Cao +5•View PDF

TL;DR Highlight

3B짜리 작은 모델이 RL(강화학습)로 쿼리를 스스로 최적화해서 GPT-4o, Claude-3.5-Sonnet보다 검색 성능을 2배 이상 뛰어넘었다.

Who Should Read

RAG 파이프라인에서 검색 품질을 올리고 싶은 백엔드/ML 개발자, 또는 의학 문헌 검색·SQL 자동화 시스템을 구축하는 엔지니어.

Core Mechanics

라벨 달린 학습 데이터 없이 RL(강화학습)만으로 쿼리를 최적화 — Recall·NDCG 같은 검색 지표를 직접 reward로 사용해서 trial & error로 학습
3B 파라미터 모델(Qwen2.5-3B-Instruct 기반)이 GPT-4o·Claude-3.5-Sonnet보다 문헌 검색·SQL 생성에서 더 높은 성능을 기록
SFT(지도 미세 조정)보다 RL이 더 효과적 — GPT-4o 증류 + 사람 어노테이션으로 학습한 LEADS(SFT)를 리콜 기준 2.6배 초과
<think> 추론 과정을 포함하면 성능 대폭 향상 — 추론 없이 학습하면 쿼리가 무한 반복 확장되는 local minimum에 빠짐
BM25 + DeepRetrieval 조합이 dense retrieval과 맞먹거나 일부 데이터셋에서 능가, 속도는 34배 빠름
SQL 생성에서도 RL이 SFT보다 우위 — groundtruth SQL 없이도 실행 정확도 기준 GPT-4o를 Spider 벤치마크에서 추월

Evidence

PubMed 문헌 검색 Recall@3K: DeepRetrieval 65.07% vs 이전 SOTA(LEADS) 24.68% — 2.6배 향상
ClinicalTrials.gov 임상시험 검색 Recall@3K: DeepRetrieval 63.18% vs 이전 SOTA 32.11% — 약 2배 향상
SQL 생성(Spider): DeepRetrieval3B-Coder 74.85% vs GPT-4o 73.50%, Claude-3.5-Sonnet 66.05%를 3B 모델로 초과
BM25 + DeepRetrieval vs dense retrieval 속도 비교: 5.42M 문서·13,332 쿼리 기준 BM25 352초 vs dense 12,232초 — 34배 빠름

How to Apply

RAG 파이프라인에서 쿼리 최적화가 필요한 경우: Qwen2.5-3B-Instruct 같은 소형 모델에 PPO/GRPO로 RL 학습을 붙이고, 실제 retriever(BM25·dense)의 NDCG나 Recall을 reward signal로 사용하면 라벨 없이도 검색 품질을 높일 수 있다.
의학 문헌 검색·PubMed API 연동 시스템 개발 시: PICO 형식(P·I·C·O) 쿼리를 입력으로 받고, Boolean 연산자(AND/OR)를 활용한 augmented query를 <think>...</think><answer>...</answer> 포맷으로 생성하도록 프롬프트를 구성하면 된다.
Text-to-SQL 파이프라인에서 SFT 대신 RL로 전환을 고려하는 경우: groundtruth SQL 없이 실행 정확도(execution accuracy)만으로 reward를 설계하면, SFT 대비 더 다양한 SQL 전략을 탐색해 성능이 오를 수 있다.

Code Example

snippet

# DeepRetrieval 스타일 쿼리 생성 프롬프트 예시 (PubMed 문헌 검색)
SYSTEM_PROMPT = """
A conversation between User and Assistant. The Assistant is a clinical specialist
conducting a medical literature review. The task is to create query terms for PubMed.

The Assistant shows thinking in <think></think> tags and returns the final answer
in <answer></answer> tags as JSON.

The query must use Boolean operators (AND, OR) and parentheses for grouping.
"""

USER_TEMPLATE = """
The research is defined by the following PICO:
P: {population}
I: {intervention}
C: {comparison}
O: {outcome}

Please generate an optimized PubMed search query.
"""

# 모델 출력 예시 (DeepRetrieval 학습 후)
# <think>
# The PICO describes desmopressin use in perioperative settings to minimize blood transfusion.
# Key terms: DDAVP (synonym for desmopressin), perioperative, blood transfusion, RCT
# </think>
# <answer>
# {
#   "query": "((DDAVP OR desmopressin) AND (perioperative OR surgery) AND (blood transfusion OR allogeneic transfusion) AND (randomized controlled trial))"
# }
# </answer>

# RL reward 함수 예시 (Recall 기반)
def compute_retrieval_reward(recall_at_k: float) -> float:
    if recall_at_k >= 0.7:
        return 5.0
    elif recall_at_k >= 0.5:
        return 4.0
    elif recall_at_k >= 0.4:
        return 3.0
    elif recall_at_k >= 0.3:
        return 1.0
    elif recall_at_k >= 0.1:
        return 0.5
    elif recall_at_k >= 0.05:
        return 0.1
    else:
        return -3.5

Terminology

PPO강화학습 알고리즘 중 하나. 모델이 너무 급격하게 변하지 않도록 업데이트 폭을 제한하면서 보상을 최대화하는 방식. 안전하게 조금씩 나아가는 등산처럼 생각하면 됨.

GRPOPPO를 단순화한 강화학습 알고리즘. DeepSeek-R1에서 사용해 유명해졌고, 별도 가치 함수(critic) 없이 그룹 내 상대적 보상으로 학습함.

NDCG검색 결과 품질을 평가하는 지표(Normalized Discounted Cumulative Gain). 관련 문서가 앞 순위에 있을수록 높은 점수. 단순 정확도보다 순서까지 고려함.

BM25키워드 빈도와 문서 길이를 활용한 전통적인 검색 알고리즘. 딥러닝 없이도 빠르고 강력해서 여전히 baseline으로 많이 쓰임.

SFT정답 예시를 보여주고 따라하게 학습시키는 지도 미세 조정(Supervised Fine-Tuning). 좋은 예시가 많을수록 유리하지만 비용이 높고 예시의 한계를 넘기 어려움.

Recall@K전체 정답 문서 중 상위 K개 결과 안에 몇 %나 들어왔는지 측정하는 지표. K=3000이면 상위 3000개 결과 안에 정답이 얼마나 포함됐는지.

Dense Retrieval문서와 쿼리를 벡터(숫자 배열)로 변환해 의미적 유사도로 검색하는 방식. 키워드가 달라도 의미가 비슷하면 찾아줌. BM25(Sparse)와 반대 개념.

PICO의학 연구 쿼리 작성 표준 프레임워크. P(환자/문제), I(중재/치료), C(비교군), O(결과)로 구성. PubMed 같은 의학 검색에서 많이 씀.

Related Resources

Original Abstract (Expand)

Information retrieval systems are crucial for enabling effective access to large document collections. Recent approaches have leveraged Large Language Models (LLMs) to enhance retrieval performance through query augmentation, but often rely on expensive supervised learning or distillation techniques that require significant computational resources and hand-labeled data. We introduce DeepRetrieval, a reinforcement learning (RL) approach that trains LLMs for query generation through trial and error without supervised data (reference query). Using retrieval metrics as rewards, our system generates queries that maximize retrieval performance. DeepRetrieval outperforms leading methods on literature search with 65.07% (vs. previous SOTA 24.68%) recall for publication search and 63.18% (vs. previous SOTA 32.11%) recall for trial search using real-world search engines. DeepRetrieval also dominates in evidence-seeking retrieval, classic information retrieval and SQL database search. With only 3B parameters, it outperforms industry-leading models like GPT-4o and Claude-3.5-Sonnet on 11/13 datasets. These results demonstrate that our RL approach offers a more efficient and effective paradigm for information retrieval. Our data and code are available at: https://github.com/pat-jj/DeepRetrieval.