Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval
TL;DR Highlight
A framework that reduces RAG noise by first judging whether retrieval is needed based on LLM uncertainty (instead of always retrieving), then searching via two parallel paths — the original query and a pseudo-document.
Who Should Read
Backend/AI developers struggling with degraded answer quality due to unnecessary retrieval noise in RAG pipelines — especially teams dealing with poor retrieval quality on short or ambiguous queries.
Core Mechanics
- Standard RAG unconditionally triggers retrieval for every query, causing noisy documents to corrupt answers the LLM already knows — DTR skips retrieval entirely when the uncertainty (negative log-likelihood) of the LLM's generated answer falls below a threshold
- To address poor retrieval quality on short, sparse queries, DTR searches via two parallel paths — the original query and an LLM-generated pseudo-answer (pseudo-context) — then combines results, achieving higher ground-truth document recall than a single-path approach
- Candidate documents from both paths are re-ranked using a geometric formula (cos(θ1+θ2)) that assigns higher scores to documents similar to both the query and the pseudo-answer, selecting the final top-k
- The training-free architecture allows plug-in integration with any LLM (e.g., Qwen2.5-7B, 72B) and works equally well with different retrievers such as bge and e5
- An LLM Judge approach (7B model) almost never triggers retrieval (0.1%), making it equivalent to no-retrieval, while a 72B Judge triggers it too often and introduces noise — uncertainty-based triggering is more stable
- Experiments demonstrate that forcing retrieval on low-uncertainty queries (where the model is confident) actually degrades accuracy
Evidence
- With Qwen2.5-7B, Standard RAG averaged EM/F1 of 35.81/45.81 vs. DTR's 37.87/48.08 (average across 5 QA benchmarks)
- With Qwen2.5-72B, Standard RAG averaged 38.83/50.73 vs. DTR's 40.46/52.14
- HotpotQA retrieval accuracy (Recall@3): Standard RAG 61.9% → DTR 62.7% (bge), and 59.3% → 62.6% (e5). In contrast, HyDE, Q2D, and CoT all scored below Standard RAG at 49.9–55.2%
- Experiments confirm that applying no-retrieval to only ~20–30% of low-uncertainty queries preserves the majority of the maximum achievable accuracy
How to Apply
- When generating an answer with the LLM, extract per-token log probabilities and compute u = -(1/T) * log P(a|q); if u is below the threshold (e.g., u=0.005), return that answer directly without retrieval — tune the threshold based on the accuracy vs. retrieval cost trade-off
- For queries that require retrieval, run two parallel searches: top-n retrieval using the original query, and top-n retrieval using a pseudo-document generated by prompting the LLM with 'Write a passage to answer this question'; then re-rank the 2n candidates using s(d)=cos(θ1+θ2) to select the final top-k
- Add the UGT layer in front of your existing RAG system and replace the retrieval module with DPR-AIS — applicable without fine-tuning, and especially effective for customer support or internal Q&A systems with many short keyword queries or ambiguous questions
Code Example
import numpy as np
def compute_uncertainty(log_probs: list[float]) -> float:
"""Compute uncertainty from a list of per-token log probabilities"""
T = len(log_probs)
return -sum(log_probs) / T
def should_retrieve(query: str, llm, threshold: float = 0.005) -> tuple[bool, str]:
"""UGT: Determine whether to trigger retrieval based on uncertainty"""
result = llm.generate_with_logprobs(query + "\nAnswer the question using a single word or phrase.")
uncertainty = compute_uncertainty(result.log_probs)
return uncertainty > threshold, result.text
def dual_path_retrieval(query: str, pseudo_context: str, retriever, n: int = 5) -> list:
"""DPR-AIS: Search via two paths (query + pseudo-document) and re-score"""
q_emb = retriever.encode(query)
p_emb = retriever.encode(pseudo_context)
docs_q = retriever.search(q_emb, top_k=n) # query path
docs_p = retriever.search(p_emb, top_k=n) # pseudo-document path
candidates = list(set(docs_q + docs_p)) # union of 2n candidates
# AIS: cos(θ1 + θ2) = s1*s2 - sqrt(1-s1²)*sqrt(1-s2²)
scored = []
for doc in candidates:
d_emb = retriever.encode(doc)
s1 = np.dot(q_emb, d_emb) # query-document similarity
s2 = np.dot(p_emb, d_emb) # pseudo-document-document similarity
joint_score = s1 * s2 - np.sqrt(max(0, 1 - s1**2)) * np.sqrt(max(0, 1 - s2**2))
scored.append((doc, joint_score))
return [doc for doc, _ in sorted(scored, key=lambda x: -x[1])[:3]]
# Usage example
def dtr_answer(query: str, llm, retriever) -> str:
needs_retrieval, initial_answer = should_retrieve(query, llm, threshold=0.005)
if not needs_retrieval:
return initial_answer # return directly if the model is confident
# Generate pseudo-document then run dual-path retrieval
pseudo = llm.generate(query + "\nWrite a passage to answer this question.")
docs = dual_path_retrieval(query, pseudo, retriever)
context = "\n".join(docs)
return llm.generate(f"{query}\n{context}\nAnswer using a single word or phrase.")Terminology
Related Resources
Original Abstract (Expand)
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single-path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training-free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual-path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval-enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at https://github.com/ChenWangHKU/DTR.