Rethinking with Retrieval: Faithful Large Language Model Inference

Dec 31, 2022•Hangfeng He, Hongming Zhang, Dan Roth•View PDF

TL;DR Highlight

A post-processing technique that searches external knowledge at each CoT reasoning step and selects the answer most faithful to facts.

Who Should Read

Developers who want to solve LLM hallucination problems where the model confidently states wrong facts. AI service developers wanting to improve GPT-class reasoning accuracy without fine-tuning.

Core Mechanics

Samples multiple reasoning paths using CoT (Chain-of-Thought), then uses each reasoning step as a query to search external knowledge (Wikipedia, Wikidata, etc.)
Compares retrieved external knowledge with each reasoning path using an NLI (Natural Language Inference) model to score 'factual faithfulness', selecting the highest-scoring prediction
Key insight: search using 'decomposed individual reasoning steps' not the full reasoning path — searching with the original question is far less effective (commonsense: 73.36% vs 77.73%)
Works as post-processing on GPT-3 without fine-tuning or additional training — applicable to any LLM
Identified two main GPT-3 error types: wrong supporting facts (e.g., incorrectly memorizing Lil Jon's biggest hit as 'Get Low') and incorrect reasoning from correct facts
Applied to small OPT models (1.3B-30B) also shows consistently higher accuracy and factual faithfulness than CoT alone

Evidence

Commonsense reasoning (StrategyQA): Self-consistency 73.36% → RR 77.73% (+4.37%p)
Temporal reasoning (TempQuestions): Self-consistency 37.28% → RR 39.05% (+1.77%p)
Tabular reasoning (INFOTABS): Self-consistency 84.00% → RR 84.83% (+0.83%p)
Explanation faithfulness: CoT 38.73% → RR Variant II 54.54% (+15.81%p)

How to Apply

In RAG pipelines, instead of using the user question directly as a search query, first decompose reasoning steps with CoT and use each step as an individual search query — finds more relevant documents
When an LLM generates multiple answer candidates (temperature > 0 sampling), add a verification layer that validates each candidate's reasoning with external KB + NLI model and auto-selects the most factually faithful answer
In high-stakes domains like finance, law, or medicine where hallucination is critical, use BM25 to search trusted internal documents and add NLI-based fact verification scoring on GPT-4 responses

Code Example

snippet

from sentence_transformers import SentenceTransformer, util
from pyserini.search.lucene import LuceneSearcher
import torch

# 1. Sample multiple reasoning paths with CoT (temperature=0.7)
reasoning_paths = [
    "Aristotle died in 2000. The first laptop was invented in 1980. So the answer is yes.",
    "Aristotle died in 322BC. The first laptop was invented in 2000. So the answer is no.",
    "Aristotle died in 322BC. The first laptop was invented in 1980. So the answer is no."
]

# 2. BM25 search using each reasoning step (sentence) as a query
searcher = LuceneSearcher.from_prebuilt_index('wikipedia-dpr')

def retrieve_for_sentence(sentence, top_k=10):
    hits = searcher.search(sentence, k=top_k)
    return [hit.raw for hit in hits]

# 3. Select the most similar paragraph using MPNet
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

def get_most_similar_para(sentence, paragraphs):
    sent_emb = model.encode(sentence, convert_to_tensor=True)
    para_embs = model.encode(paragraphs, convert_to_tensor=True)
    scores = util.cos_sim(sent_emb, para_embs)[0]
    best_idx = scores.argmax().item()
    return paragraphs[best_idx], scores[best_idx].item()

# 4. Compute faithfulness score using an NLI model
from transformers import pipeline
nli = pipeline('text-classification', model='cross-encoder/nli-deberta-v3-base')

def faithfulness_score(sentence, premise):
    result = nli(f"{premise} [SEP] {sentence}")[0]
    if result['label'] == 'ENTAILMENT':
        return result['score']
    elif result['label'] == 'CONTRADICTION':
        return -result['score']
    return 0

# 5. Select the prediction from the most faithful reasoning path
path_scores = {}
for path in reasoning_paths:
    sentences = path.split('. ')
    prediction = sentences[-1]  # 'So the answer is ...'
    score = 0
    for sent in sentences[:-1]:
        paras = retrieve_for_sentence(sent)
        best_para, sim = get_most_similar_para(sent, paras)
        score += faithfulness_score(sent, best_para)
    path_scores[path] = (score, prediction)

best_path = max(path_scores, key=lambda x: path_scores[x][0])
print(f"Final prediction: {path_scores[best_path][1]}")

Terminology

CoTChain-of-Thought. A prompting technique where LLMs are asked to 'think step by step before answering.' Like writing out work on a math problem — intermediate reasoning improves performance.

Self-consistencyGenerating multiple responses to the same question and picking the most common answer. Like majority voting.

NLINatural Language Inference. Classifying whether one sentence supports (entailment), contradicts (contradiction), or is neutral toward another sentence.

FaithfulnessHow closely AI's reasoning matches actual facts. A response can sound fluent but contain wrong information — faithfulness measures factual accuracy of the reasoning chain.

Post-processingAdding a separate processing step after LLM response generation to verify or improve the output. Unlike fine-tuning, it doesn't change model parameters.

Related Resources

Original Abstract (Expand)

Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs.