Systematic Review를 위한 LLM 기반 논문 스크리닝 Prompt Template 개발
Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews
TL;DR Highlight
LLM에게 논문 포함/제외 판단을 맡기면 83시간짜리 작업을 하루 만에 $157로 끝낼 수 있다.
Who Should Read
의학/연구 분야에서 대량 문헌 검토 자동화를 고민하는 개발자 또는 리서처. 특히 LLM을 활용해 기준 기반 분류 파이프라인을 구축하려는 경우.
Core Mechanics
- Zero-shot 프롬프트는 민감도(sensitivity)가 49%에 불과 — 절반을 놓침. 프롬프트 설계가 핵심
- 최적화된 프롬프트로 GPT4-0125-preview가 초록 스크리닝에서 민감도 97.7%, 특이도 85.2% 달성
- 전문 텍스트(full-text) 스크리닝도 민감도 96.5%, 특이도 91.2%로 고성능 유지
- Claude-3.5와 GPT4 계열은 비슷한 성능, Gemini Pro와 GPT-3.5는 낮은 성능
- 10,000건 스크리닝 비용: 사람이 하면 83시간 + $1,666 vs LLM은 하루 이내 + $157
- 10개 다른 Systematic Review(체계적 문헌고찰)에 범용 적용 가능한 템플릿 구조로 설계
Evidence
- GPT4-0125-preview 최적화 프롬프트: 초록 스크리닝 가중 민감도 97.7%, 특이도 85.2% (10개 SR 기준)
- 전문 텍스트 스크리닝: 가중 민감도 96.5%, 특이도 91.2%
- Zero-shot 대비 최적화 프롬프트: 민감도 49.0% → 97.7% (약 2배 향상)
- 비용 비교: 사람 $1,666.67 vs LLM $157.02 (약 90% 절감), 시간 83시간 → 1일 미만
How to Apply
- 기준 기반 분류 작업(예: 고객 피드백 분류, 지원서 필터링)에서 Zero-shot 대신 eligibility criteria를 명시적으로 나열하는 프롬프트 구조를 사용하면 민감도를 크게 높일 수 있음
- 대량 문서 스크리닝 파이프라인을 만들 때 GPT-4 계열 또는 Claude-3.5를 우선 선택하고, GPT-3.5나 Gemini Pro는 품질 검증 없이 사용하지 말 것
- 프롬프트에 포함/제외 기준을 구조화된 리스트로 제공하고, 모델에게 각 기준에 대한 판단 근거를 함께 출력하도록 요청하면 재현성과 감사 가능성이 올라감
Code Example
# Systematic Review 스타일의 기준 기반 스크리닝 프롬프트 템플릿
SYSTEM_PROMPT = """
You are an expert research screener. Your task is to determine whether a given article meets the eligibility criteria for inclusion in a systematic review.
Eligibility Criteria:
INCLUSION:
- {inclusion_criterion_1}
- {inclusion_criterion_2}
- {inclusion_criterion_3}
EXCLUSION:
- {exclusion_criterion_1}
- {exclusion_criterion_2}
Instructions:
1. Read the abstract carefully.
2. Evaluate each criterion one by one.
3. Output your decision as JSON: {"decision": "INCLUDE" or "EXCLUDE", "reason": "brief explanation", "confidence": "high/medium/low"}
"""
USER_PROMPT = """
Please screen the following abstract:
Title: {article_title}
Abstract: {abstract_text}
"""
# 사용 예시
import openai
def screen_abstract(title, abstract, inclusion_criteria, exclusion_criteria):
system = SYSTEM_PROMPT.format(
inclusion_criterion_1=inclusion_criteria[0],
inclusion_criterion_2=inclusion_criteria[1],
inclusion_criterion_3=inclusion_criteria[2] if len(inclusion_criteria) > 2 else "N/A",
exclusion_criterion_1=exclusion_criteria[0],
exclusion_criterion_2=exclusion_criteria[1] if len(exclusion_criteria) > 1 else "N/A"
)
user = USER_PROMPT.format(article_title=title, abstract_text=abstract)
response = openai.chat.completions.create(
model="gpt-4-0125-preview",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user}
],
temperature=0 # 재현성을 위해 0 권장
)
return response.choices[0].message.contentTerminology
Original Abstract (Expand)
BACKGROUND Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. OBJECTIVE To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews. DESIGN Diagnostic test accuracy. SETTING 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI). PARTICIPANTS None. MEASUREMENTS Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity). RESULTS Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD. LIMITATIONS Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles. CONCLUSION A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences. PRIMARY FUNDING SOURCE None.