Systematic Review를 위한 LLM 기반 논문 스크리닝 Prompt Template 개발

Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews

Jan 1, 2025•C. Cao, J. Sang, R. Arora +16•View PDF

TL;DR Highlight

LLM에게 논문 포함/제외 판단을 맡기면 83시간짜리 작업을 하루 만에 $157로 끝낼 수 있다.

Who Should Read

의학/연구 분야에서 대량 문헌 검토 자동화를 고민하는 개발자 또는 리서처. 특히 LLM을 활용해 기준 기반 분류 파이프라인을 구축하려는 경우.

Core Mechanics

Zero-shot 프롬프트는 민감도(sensitivity)가 49%에 불과 — 절반을 놓침. 프롬프트 설계가 핵심
최적화된 프롬프트로 GPT4-0125-preview가 초록 스크리닝에서 민감도 97.7%, 특이도 85.2% 달성
전문 텍스트(full-text) 스크리닝도 민감도 96.5%, 특이도 91.2%로 고성능 유지
Claude-3.5와 GPT4 계열은 비슷한 성능, Gemini Pro와 GPT-3.5는 낮은 성능
10,000건 스크리닝 비용: 사람이 하면 83시간 + $1,666 vs LLM은 하루 이내 + $157
10개 다른 Systematic Review(체계적 문헌고찰)에 범용 적용 가능한 템플릿 구조로 설계

Evidence

GPT4-0125-preview 최적화 프롬프트: 초록 스크리닝 가중 민감도 97.7%, 특이도 85.2% (10개 SR 기준)
전문 텍스트 스크리닝: 가중 민감도 96.5%, 특이도 91.2%
Zero-shot 대비 최적화 프롬프트: 민감도 49.0% → 97.7% (약 2배 향상)
비용 비교: 사람 $1,666.67 vs LLM $157.02 (약 90% 절감), 시간 83시간 → 1일 미만

How to Apply

기준 기반 분류 작업(예: 고객 피드백 분류, 지원서 필터링)에서 Zero-shot 대신 eligibility criteria를 명시적으로 나열하는 프롬프트 구조를 사용하면 민감도를 크게 높일 수 있음
대량 문서 스크리닝 파이프라인을 만들 때 GPT-4 계열 또는 Claude-3.5를 우선 선택하고, GPT-3.5나 Gemini Pro는 품질 검증 없이 사용하지 말 것
프롬프트에 포함/제외 기준을 구조화된 리스트로 제공하고, 모델에게 각 기준에 대한 판단 근거를 함께 출력하도록 요청하면 재현성과 감사 가능성이 올라감

Code Example

snippet

# Systematic Review 스타일의 기준 기반 스크리닝 프롬프트 템플릿

SYSTEM_PROMPT = """
You are an expert research screener. Your task is to determine whether a given article meets the eligibility criteria for inclusion in a systematic review.

Eligibility Criteria:
INCLUSION:
- {inclusion_criterion_1}
- {inclusion_criterion_2}
- {inclusion_criterion_3}

EXCLUSION:
- {exclusion_criterion_1}
- {exclusion_criterion_2}

Instructions:
1. Read the abstract carefully.
2. Evaluate each criterion one by one.
3. Output your decision as JSON: {"decision": "INCLUDE" or "EXCLUDE", "reason": "brief explanation", "confidence": "high/medium/low"}
"""

USER_PROMPT = """
Please screen the following abstract:

Title: {article_title}
Abstract: {abstract_text}
"""

# 사용 예시
import openai

def screen_abstract(title, abstract, inclusion_criteria, exclusion_criteria):
    system = SYSTEM_PROMPT.format(
        inclusion_criterion_1=inclusion_criteria[0],
        inclusion_criterion_2=inclusion_criteria[1],
        inclusion_criterion_3=inclusion_criteria[2] if len(inclusion_criteria) > 2 else "N/A",
        exclusion_criterion_1=exclusion_criteria[0],
        exclusion_criterion_2=exclusion_criteria[1] if len(exclusion_criteria) > 1 else "N/A"
    )
    user = USER_PROMPT.format(article_title=title, abstract_text=abstract)
    
    response = openai.chat.completions.create(
        model="gpt-4-0125-preview",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user}
        ],
        temperature=0  # 재현성을 위해 0 권장
    )
    return response.choices[0].message.content

Terminology

Systematic Review특정 연구 질문에 답하기 위해 관련 논문을 체계적으로 모두 검색·평가·종합하는 연구 방법. 의학 분야에서 치료 효과를 판단할 때 가장 신뢰도 높은 근거로 쓰임.

sensitivity실제로 포함해야 할 논문을 빠뜨리지 않고 잡아내는 능력. 높을수록 '놓치는 게 없다'는 뜻.

specificity실제로 제외해야 할 논문을 정확히 걸러내는 능력. 높을수록 '잘못 포함하는 게 없다'는 뜻.

Zero-shot예시나 추가 설명 없이 바로 질문만 던지는 프롬프트 방식. 모델이 사전 지식만으로 답해야 해서 복잡한 기준 적용엔 약함.

eligibility criteria어떤 논문을 포함하고 제외할지 미리 정해둔 조건 목록. 스크리닝의 핵심 판단 기준.

abstract screening논문의 초록(요약)만 보고 본문 검토 대상인지 1차로 걸러내는 작업. SR에서 수천~수만 건을 처리해야 해서 병목 지점이 됨.

full-text screening1차 통과한 논문의 전체 본문을 읽고 최종 포함 여부를 판단하는 2차 검토 단계.

Original Abstract (Expand)

BACKGROUND Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. OBJECTIVE To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews. DESIGN Diagnostic test accuracy. SETTING 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI). PARTICIPANTS None. MEASUREMENTS Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity). RESULTS Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD. LIMITATIONS Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles. CONCLUSION A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences. PRIMARY FUNDING SOURCE None.