Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval

Jan 7, 2026•Wang Chen, Guanqiang Qi, Weikang Li +3•View PDF

TL;DR Highlight

A framework that reduces RAG noise by first judging whether retrieval is needed based on LLM uncertainty (instead of always retrieving), then searching via two parallel paths — the original query and a pseudo-document.

Who Should Read

Backend/AI developers struggling with degraded answer quality due to unnecessary retrieval noise in RAG pipelines — especially teams dealing with poor retrieval quality on short or ambiguous queries.

Core Mechanics

Standard RAG unconditionally triggers retrieval for every query, causing noisy documents to corrupt answers the LLM already knows — DTR skips retrieval entirely when the uncertainty (negative log-likelihood) of the LLM's generated answer falls below a threshold
To address poor retrieval quality on short, sparse queries, DTR searches via two parallel paths — the original query and an LLM-generated pseudo-answer (pseudo-context) — then combines results, achieving higher ground-truth document recall than a single-path approach
Candidate documents from both paths are re-ranked using a geometric formula (cos(θ1+θ2)) that assigns higher scores to documents similar to both the query and the pseudo-answer, selecting the final top-k
The training-free architecture allows plug-in integration with any LLM (e.g., Qwen2.5-7B, 72B) and works equally well with different retrievers such as bge and e5
An LLM Judge approach (7B model) almost never triggers retrieval (0.1%), making it equivalent to no-retrieval, while a 72B Judge triggers it too often and introduces noise — uncertainty-based triggering is more stable
Experiments demonstrate that forcing retrieval on low-uncertainty queries (where the model is confident) actually degrades accuracy

Evidence

With Qwen2.5-7B, Standard RAG averaged EM/F1 of 35.81/45.81 vs. DTR's 37.87/48.08 (average across 5 QA benchmarks)
With Qwen2.5-72B, Standard RAG averaged 38.83/50.73 vs. DTR's 40.46/52.14
HotpotQA retrieval accuracy (Recall@3): Standard RAG 61.9% → DTR 62.7% (bge), and 59.3% → 62.6% (e5). In contrast, HyDE, Q2D, and CoT all scored below Standard RAG at 49.9–55.2%
Experiments confirm that applying no-retrieval to only ~20–30% of low-uncertainty queries preserves the majority of the maximum achievable accuracy

How to Apply

When generating an answer with the LLM, extract per-token log probabilities and compute u = -(1/T) * log P(a|q); if u is below the threshold (e.g., u=0.005), return that answer directly without retrieval — tune the threshold based on the accuracy vs. retrieval cost trade-off
For queries that require retrieval, run two parallel searches: top-n retrieval using the original query, and top-n retrieval using a pseudo-document generated by prompting the LLM with 'Write a passage to answer this question'; then re-rank the 2n candidates using s(d)=cos(θ1+θ2) to select the final top-k
Add the UGT layer in front of your existing RAG system and replace the retrieval module with DPR-AIS — applicable without fine-tuning, and especially effective for customer support or internal Q&A systems with many short keyword queries or ambiguous questions

Code Example

snippet

import numpy as np

def compute_uncertainty(log_probs: list[float]) -> float:
    """Compute uncertainty from a list of per-token log probabilities"""
    T = len(log_probs)
    return -sum(log_probs) / T

def should_retrieve(query: str, llm, threshold: float = 0.005) -> tuple[bool, str]:
    """UGT: Determine whether to trigger retrieval based on uncertainty"""
    result = llm.generate_with_logprobs(query + "\nAnswer the question using a single word or phrase.")
    uncertainty = compute_uncertainty(result.log_probs)
    return uncertainty > threshold, result.text

def dual_path_retrieval(query: str, pseudo_context: str, retriever, n: int = 5) -> list:
    """DPR-AIS: Search via two paths (query + pseudo-document) and re-score"""
    q_emb = retriever.encode(query)
    p_emb = retriever.encode(pseudo_context)
    
    docs_q = retriever.search(q_emb, top_k=n)  # query path
    docs_p = retriever.search(p_emb, top_k=n)  # pseudo-document path
    candidates = list(set(docs_q + docs_p))     # union of 2n candidates
    
    # AIS: cos(θ1 + θ2) = s1*s2 - sqrt(1-s1²)*sqrt(1-s2²)
    scored = []
    for doc in candidates:
        d_emb = retriever.encode(doc)
        s1 = np.dot(q_emb, d_emb)  # query-document similarity
        s2 = np.dot(p_emb, d_emb)  # pseudo-document-document similarity
        joint_score = s1 * s2 - np.sqrt(max(0, 1 - s1**2)) * np.sqrt(max(0, 1 - s2**2))
        scored.append((doc, joint_score))
    
    return [doc for doc, _ in sorted(scored, key=lambda x: -x[1])[:3]]

# Usage example
def dtr_answer(query: str, llm, retriever) -> str:
    needs_retrieval, initial_answer = should_retrieve(query, llm, threshold=0.005)
    
    if not needs_retrieval:
        return initial_answer  # return directly if the model is confident
    
    # Generate pseudo-document then run dual-path retrieval
    pseudo = llm.generate(query + "\nWrite a passage to answer this question.")
    docs = dual_path_retrieval(query, pseudo, retriever)
    
    context = "\n".join(docs)
    return llm.generate(f"{query}\n{context}\nAnswer using a single word or phrase.")

Terminology

RAGA method where an LLM retrieves relevant documents from an external source to supplement its answer. A typical example is connecting internal company documents to GPT for Q&A.

Uncertainty (불확실도)The inverse of how confident an LLM is when predicting the next token. Computed as the average negative log probability across tokens — a higher value means the model is less certain.

Sparse QueryA short search query that lacks sufficient information. For example, a context-free question like 'Python error' makes it difficult for a search engine to determine intent.

Pseudo-contextA hypothetical document that the LLM writes itself, imagining what kind of document might answer the question. It is not the actual answer, but is used to assist retrieval.

Dual-Path RetrievalA retrieval approach that searches via two parallel paths — once with the original query and once with an LLM-generated pseudo-document — to obtain a broader set of candidates.

EM (Exact Match)The proportion of predicted answers that match the ground truth exactly, character for character. A stricter evaluation metric than F1.

Training-freeAn approach that can be applied directly to existing models without any additional training. It can be attached like a plug-in without incurring fine-tuning costs.

Related Resources

https://github.com/ChenWangHKU/DTR

Original Abstract (Expand)

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single-path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training-free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual-path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval-enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at https://github.com/ChenWangHKU/DTR.