Synthesizing scientific literature with retrieval-augmented language models

Jan 1, 2026•Akari Asai, Jacqueline He, Rulin Shao +25•View PDF

TL;DR Highlight

A RAG-based scientific literature synthesis model that searches 45 million open-access papers and attaches citation sources.

Who Should Read

Researchers, academics, and R&D teams who need to synthesize large bodies of scientific literature quickly with verifiable citations.

Core Mechanics

Indexes 45 million open-access papers and enables semantic search across the corpus
Generates synthesized summaries grounded in retrieved papers with inline citations
Outperforms general-purpose LLMs on scientific QA tasks where citations are required
Retrieval pipeline uses dense embeddings + sparse BM25 hybrid search for high recall
Citation accuracy (correctly attributing claims to the right papers) significantly higher than baseline RAG

Evidence

Evaluated on scientific QA benchmarks; outperforms GPT-4 + web search on citation accuracy
45M paper corpus covers most major open-access repositories (arXiv, PubMed, Semantic Scholar, etc.)
Human evaluation confirms synthesized summaries are more factually grounded than uncited LLM outputs

How to Apply

Use this system (or similar RAG pipelines) when you need literature-backed answers rather than LLM hallucinations about research.
For your own RAG pipeline over scientific corpora, implement hybrid retrieval (dense + BM25) to improve recall on rare terms.
Always surface citation links to users — grounding claims in actual papers dramatically improves trust and verifiability.

Code Example

snippet

# OpenScholar self-feedback loop conceptual prompt example

system_prompt = """
You are a scientific literature synthesis assistant.
Given retrieved passages with citation keys, write a factual answer.
After drafting, review each claim and verify it is directly supported
by at least one cited passage. Remove or correct any unsupported claims.
"""

user_prompt = """
Query: {user_question}

Retrieved passages:
[1] {passage_1} (Source: {paper_1_title}, {paper_1_year})
[2] {passage_2} (Source: {paper_2_title}, {paper_2_year})
...

Step 1: Draft a synthesis answer with inline citations [1], [2], ...
Step 2: Self-check — does every sentence have a supporting citation?
         If not, revise or remove that sentence.
Step 3: Output the final answer.
"""

Terminology

RAGRetrieval-Augmented Generation. An architecture that retrieves relevant documents at query time and feeds them to the LLM as context before generating a response.

BM25A classic sparse retrieval algorithm based on term frequency and inverse document frequency; still competitive for exact keyword matching.

Dense EmbeddingA vector representation of text generated by a neural encoder; enables semantic similarity search.

Hybrid SearchCombining dense semantic search and sparse keyword search (BM25) to improve overall retrieval quality.

Related Resources

Original Abstract (Expand)

Scientific progress depends on the ability of researchers to synthesize the growing body of literature. Can large language models (LLMs) assist scientists in this task? Here we introduce OpenScholar, a specialized retrieval-augmented language model (LM)1 that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience and biomedicine. Despite being a smaller open model, OpenScholar-8B outperforms GPT-4o by 6.1% and PaperQA2 by 5.5% in correctness on a challenging multi-paper synthesis task from the new ScholarQABench. Although GPT-4o hallucinates citations 78–90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar’s data store, retriever and self-feedback inference loop improve off-the-shelf LMs: for instance, OpenScholar-GPT-4o improves the correctness of GPT-4o by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT-4o responses over expert-written ones 51% and 70% of the time, respectively, compared with 32% for GPT-4o. We open-source all artefacts, including our code, models, data store, datasets and a public demo. A specialized, open-source, retrieval-augmented language model is introduced for answering scientific queries and synthesizing literature, the responses of which are shown to be preferred by human evaluations over expert-written answers.