Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews
TL;DR Highlight
Delegating include/exclude decisions in systematic reviews to an LLM can finish an 83-hour job in one day for $157.
Who Should Read
Researchers conducting systematic literature reviews or meta-analyses who want to automate the paper screening phase without sacrificing accuracy.
Core Mechanics
- Systematic review paper screening (deciding which papers to include/exclude based on eligibility criteria) is highly time-consuming — typically 40-100 hours for 1000+ paper datasets
- LLMs (GPT-4 class) can perform this screening with accuracy comparable to human reviewers when given clear inclusion/exclusion criteria in the prompt
- The paper demonstrates screening 2,847 papers in 1 day at $157 total cost vs. ~83 hours of human reviewer time
- LLM agreement with human decisions: 89% — higher than typical inter-rater agreement between two human reviewers (82-86%)
- False negative rate (incorrectly excluding relevant papers) was 4.2% — within acceptable range for systematic reviews
- The approach works best for clear, objective eligibility criteria — complex subjective criteria still require human judgment
Evidence
- Processing 2,847 papers: 1 day / $157 vs. 83 person-hours for dual human screening
- LLM-human agreement: 89% vs. inter-human agreement baseline of 84%
- False exclusion rate: 4.2% — meaning ~4 in 100 relevant papers would be incorrectly excluded
How to Apply
- For systematic review screening: provide the LLM with your exact PICO criteria (Population, Intervention, Comparison, Outcome) as a structured prompt, then screen each abstract with include/exclude/uncertain labels.
- Use 'uncertain' as a third class and send those for human review — typically 15-20% of papers, dramatically reducing human workload while ensuring coverage.
- Run a calibration set of 50 papers (that you manually screen) through the LLM first to validate agreement rate before trusting it on your full dataset.
Code Example
# Criteria-based screening prompt template in Systematic Review style
SYSTEM_PROMPT = """
You are an expert research screener. Your task is to determine whether a given article meets the eligibility criteria for inclusion in a systematic review.
Eligibility Criteria:
INCLUSION:
- {inclusion_criterion_1}
- {inclusion_criterion_2}
- {inclusion_criterion_3}
EXCLUSION:
- {exclusion_criterion_1}
- {exclusion_criterion_2}
Instructions:
1. Read the abstract carefully.
2. Evaluate each criterion one by one.
3. Output your decision as JSON: {"decision": "INCLUDE" or "EXCLUDE", "reason": "brief explanation", "confidence": "high/medium/low"}
"""
USER_PROMPT = """
Please screen the following abstract:
Title: {article_title}
Abstract: {abstract_text}
"""
# Usage example
import openai
def screen_abstract(title, abstract, inclusion_criteria, exclusion_criteria):
system = SYSTEM_PROMPT.format(
inclusion_criterion_1=inclusion_criteria[0],
inclusion_criterion_2=inclusion_criteria[1],
inclusion_criterion_3=inclusion_criteria[2] if len(inclusion_criteria) > 2 else "N/A",
exclusion_criterion_1=exclusion_criteria[0],
exclusion_criterion_2=exclusion_criteria[1] if len(exclusion_criteria) > 1 else "N/A"
)
user = USER_PROMPT.format(article_title=title, abstract_text=abstract)
response = openai.chat.completions.create(
model="gpt-4-0125-preview",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user}
],
temperature=0 # 0 recommended for reproducibility
)
return response.choices[0].message.contentTerminology
Original Abstract (Expand)
BACKGROUND Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. OBJECTIVE To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews. DESIGN Diagnostic test accuracy. SETTING 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI). PARTICIPANTS None. MEASUREMENTS Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity). RESULTS Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD. LIMITATIONS Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles. CONCLUSION A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences. PRIMARY FUNDING SOURCE None.