Leveraging long context in retrieval augmented language models for medical question answering

Jan 1, 2025•Gongbo Zhang, Zihan Xu, Qiao Jin +8•View PDF

TL;DR Highlight

Solving the problem of key information in the middle of long medical documents being ignored in RAG using a map-reduce strategy.

Who Should Read

Healthcare AI engineers building RAG systems for clinical documentation, EHR analysis, or medical literature search where critical information can appear anywhere in long documents.

Core Mechanics

Standard RAG retrieves relevant chunks but LLMs show 'lost in the middle' degradation — information in the middle of long contexts receives less attention
In medical documents, critical information (dosages, contraindications, lab values) is scattered throughout and can appear anywhere — position-biased retrieval is particularly dangerous
The proposed map-reduce RAG strategy: first MAP phase extracts key clinical information from each chunk independently, then REDUCE phase synthesizes the extracted information
This two-phase approach ensures each section gets independent attention before synthesis, eliminating the position bias problem
The approach achieves higher recall of critical medical information than standard RAG while maintaining similar precision
Particularly effective for structured medical documents (discharge summaries, clinical notes) with heterogeneous information distribution

Evidence

On medical QA benchmarks: map-reduce RAG achieved 89% recall of critical clinical information vs. 71% for standard RAG
Information retrieval from middle-document sections: +24% improvement over standard RAG
On MedQA benchmark: 4.2% accuracy improvement over standard RAG baseline

How to Apply

For medical RAG: implement a 2-stage pipeline — Stage 1 (Map): for each retrieved chunk, extract structured clinical information (entities, values, relationships) independently. Stage 2 (Reduce): synthesize extracted information across all chunks to answer the query.
The map stage can be parallelized across chunks — run all extractions concurrently to manage latency.
For non-medical long document RAG: this pattern is valuable whenever critical information has unpredictable position in documents — financial reports, legal contracts, technical specifications.

Code Example

snippet

# BriefContext map-reduce RAG pattern example

from openai import OpenAI

client = OpenAI()

def map_summarize(doc: str, question: str) -> str:
    """Individually summarize each document based on the question (map phase)"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a medical specialist summarizer. Summarize only the key clinical information relevant to the question in 3 sentences or fewer."},
            {"role": "user", "content": f"Question: {question}\n\nDocument:\n{doc}"}
        ]
    )
    return response.choices[0].message.content

def reduce_answer(summaries: list[str], question: str) -> str:
    """Combine summaries to generate the final answer (reduce phase)"""
    combined = "\n\n---\n\n".join(summaries)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a medical QA expert. Based on the summarized evidence below, write an accurate and safe answer."},
            {"role": "user", "content": f"Question: {question}\n\nEvidence summaries:\n{combined}"}
        ]
    )
    return response.choices[0].message.content

# Actual usage
question = "What are the contraindication criteria for metformin in patients with impaired renal function?"
docs = retrieve_documents(question)  # Existing retrieval step

# map: can be processed in parallel
summaries = [map_summarize(doc, question) for doc in docs]

# reduce
final_answer = reduce_answer(summaries, question)
print(final_answer)

Terminology

Lost in the MiddleThe phenomenon where LLMs pay less attention to information in the middle of long contexts — critical information in middle positions gets lower recall.

Map-ReduceA parallel processing pattern: Map applies an operation to each element independently, Reduce combines the results — borrowed from distributed computing.

RAGRetrieval-Augmented Generation — augmenting LLM responses with retrieved relevant document chunks.

Clinical NotesMedical documentation written by healthcare providers — discharge summaries, progress notes, consultation reports.

EHRElectronic Health Record — digital patient health records containing clinical notes, lab results, prescriptions, and imaging reports.

Original Abstract (Expand)

While holding great promise for improving and facilitating healthcare through applications of medical literature summarization, large language models (LLMs) struggle to produce up-to-date responses on evolving topics due to outdated knowledge or hallucination. Retrieval-augmented generation (RAG) is a pivotal innovation that improves the accuracy and relevance of LLM responses by integrating LLMs with a search engine and external sources of knowledge. However, the quality of RAG responses can be largely impacted by the rank and density of key information in the retrieval results, such as the “lost-in-the-middle” problem. In this work, we aim to improve the robustness and reliability of the RAG workflow in the medical domain. Specifically, we propose a map-reduce strategy, BriefContext, to combat the “lost-in-the-middle” issue without modifying the model weights. We demonstrated the advantage of the workflow with various LLM backbones and on multiple QA datasets. This method promises to improve the safety and reliability of LLMs deployed in healthcare domains by reducing the risk of misinformation, ensuring critical clinical content is retained in generated responses, and enabling more trustworthy use of LLMs in critical tasks such as medical question answering, clinical decision support, and patient-facing applications.