Rethinking with Retrieval: Faithful Large Language Model Inference
TL;DR Highlight
A post-processing technique that searches external knowledge at each CoT reasoning step and selects the answer most faithful to facts.
Who Should Read
Developers who want to solve LLM hallucination problems where the model confidently states wrong facts. AI service developers wanting to improve GPT-class reasoning accuracy without fine-tuning.
Core Mechanics
- Samples multiple reasoning paths using CoT (Chain-of-Thought), then uses each reasoning step as a query to search external knowledge (Wikipedia, Wikidata, etc.)
- Compares retrieved external knowledge with each reasoning path using an NLI (Natural Language Inference) model to score 'factual faithfulness', selecting the highest-scoring prediction
- Key insight: search using 'decomposed individual reasoning steps' not the full reasoning path — searching with the original question is far less effective (commonsense: 73.36% vs 77.73%)
- Works as post-processing on GPT-3 without fine-tuning or additional training — applicable to any LLM
- Identified two main GPT-3 error types: wrong supporting facts (e.g., incorrectly memorizing Lil Jon's biggest hit as 'Get Low') and incorrect reasoning from correct facts
- Applied to small OPT models (1.3B-30B) also shows consistently higher accuracy and factual faithfulness than CoT alone
Evidence
- Commonsense reasoning (StrategyQA): Self-consistency 73.36% → RR 77.73% (+4.37%p)
- Temporal reasoning (TempQuestions): Self-consistency 37.28% → RR 39.05% (+1.77%p)
- Tabular reasoning (INFOTABS): Self-consistency 84.00% → RR 84.83% (+0.83%p)
- Explanation faithfulness: CoT 38.73% → RR Variant II 54.54% (+15.81%p)
How to Apply
- In RAG pipelines, instead of using the user question directly as a search query, first decompose reasoning steps with CoT and use each step as an individual search query — finds more relevant documents
- When an LLM generates multiple answer candidates (temperature > 0 sampling), add a verification layer that validates each candidate's reasoning with external KB + NLI model and auto-selects the most factually faithful answer
- In high-stakes domains like finance, law, or medicine where hallucination is critical, use BM25 to search trusted internal documents and add NLI-based fact verification scoring on GPT-4 responses
Code Example
from sentence_transformers import SentenceTransformer, util
from pyserini.search.lucene import LuceneSearcher
import torch
# 1. Sample multiple reasoning paths with CoT (temperature=0.7)
reasoning_paths = [
"Aristotle died in 2000. The first laptop was invented in 1980. So the answer is yes.",
"Aristotle died in 322BC. The first laptop was invented in 2000. So the answer is no.",
"Aristotle died in 322BC. The first laptop was invented in 1980. So the answer is no."
]
# 2. BM25 search using each reasoning step (sentence) as a query
searcher = LuceneSearcher.from_prebuilt_index('wikipedia-dpr')
def retrieve_for_sentence(sentence, top_k=10):
hits = searcher.search(sentence, k=top_k)
return [hit.raw for hit in hits]
# 3. Select the most similar paragraph using MPNet
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
def get_most_similar_para(sentence, paragraphs):
sent_emb = model.encode(sentence, convert_to_tensor=True)
para_embs = model.encode(paragraphs, convert_to_tensor=True)
scores = util.cos_sim(sent_emb, para_embs)[0]
best_idx = scores.argmax().item()
return paragraphs[best_idx], scores[best_idx].item()
# 4. Compute faithfulness score using an NLI model
from transformers import pipeline
nli = pipeline('text-classification', model='cross-encoder/nli-deberta-v3-base')
def faithfulness_score(sentence, premise):
result = nli(f"{premise} [SEP] {sentence}")[0]
if result['label'] == 'ENTAILMENT':
return result['score']
elif result['label'] == 'CONTRADICTION':
return -result['score']
return 0
# 5. Select the prediction from the most faithful reasoning path
path_scores = {}
for path in reasoning_paths:
sentences = path.split('. ')
prediction = sentences[-1] # 'So the answer is ...'
score = 0
for sent in sentences[:-1]:
paras = retrieve_for_sentence(sent)
best_para, sim = get_most_similar_para(sent, paras)
score += faithfulness_score(sent, best_para)
path_scores[path] = (score, prediction)
best_path = max(path_scores, key=lambda x: path_scores[x][0])
print(f"Final prediction: {path_scores[best_path][1]}")Terminology
Related Resources
Original Abstract (Expand)
Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.