ReqFusion: 소프트웨어 도메인 전반에서 PEGS 분석을 자동화하는 Multi-Provider 프레임워크

ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

Mar 24, 2026•Muhammad Khalid, Manuel Oriol, Yilmaz Uygun•View PDF

TL;DR Highlight

GPT-4, Claude-3, Groq 세 모델을 동시에 돌려서 소프트웨어 요구사항을 자동 추출하면 F1 0.88, 분석 시간 78% 단축된다.

Who Should Read

소프트웨어 요구사항 문서(RFP, 제안서, 기술명세서)를 반복적으로 분석해야 하는 PM이나 백엔드 개발자. 특히 LLM 환각(hallucination)으로 인한 잘못된 요구사항 추출을 줄이고 싶은 팀.

Core Mechanics

PEGS(Project/Environment/Goals/System) 4개 카테고리로 나눠서 프롬프트를 구조화하면 '전부 뽑아줘' 식 generic 프롬프트보다 F1이 0.71 → 0.88로 오른다
GPT-4, Claude-3, Groq/Llama 세 모델을 병렬로 돌리고 consensus(투표) 방식으로 결과를 합치면 단일 모델보다 정확도가 높아진다 (GPT-4 단독 F1 0.81 → 멀티 0.88)
여러 모델이 동의한 요구사항은 오탐률 8%, 한 모델만 뽑은 요구사항은 오탐률 34%로 consensus가 hallucination 필터 역할을 한다
병렬 모드로 처리하면 응답 레이턴시가 4.2s → 1.2s로 71% 빨라진다 (순차 처리 대비)
비용도 요구사항 1개당 $0.082 → $0.043으로 47% 절감된다 (Groq/Llama로 단순 분류를 저가 라우팅)
Environment 카테고리가 F1 0.79로 가장 낮은데, 법적 조항에 묻혀있는 암묵적 요구사항을 못 잡는 게 주요 원인

Evidence

PEGS 구조화 프롬프트 F1 0.88 vs generic 프롬프트 F1 0.71, 절대 차이 +0.17 (동일한 멀티프로바이더 설정에서 ablation 테스트)
1,050개 요구사항 기준 수동 분석 대비 분석 시간 78% 단축 (수동 4.9분/요구사항 → 자동 1.1분)
멀티프로바이더 consensus 오탐률 8% vs 단일 프로바이더 오탐률 34% (200개 샘플 기준)
PEGS 카테고리 커버리지 generic 61.3% → PEGS 프롬프트 92.0%, 30.7%p 향상

How to Apply

요구사항 추출 프롬프트를 하나로 쓰지 말고 Project/Environment/Goals/System 4개로 쪼개서 각각 따로 날려라. '요구사항 전부 뽑아줘' 대신 'Project 관련 이해관계자, 예산, 일정 제약을 뽑아줘' 식으로.
중요한 요구사항은 GPT-4와 Claude-3 두 모델에 동일하게 보내고, 두 결과를 비교해서 하나만 뽑은 항목은 '검토 필요' 플래그를 달아라. 비용이 부담이면 1차는 Groq/Llama로 돌리고 confidence 낮은 것만 고급 모델로 재처리.
문서에서 추출된 요구사항마다 출처 페이지와 PEGS 카테고리를 메타데이터로 붙여놓으면 나중에 테스트케이스나 설계 문서와 연결(traceability)하기 쉬워진다.

Code Example

snippet

# PEGS 카테고리별 구조화 프롬프트 예시 (Python)
import openai

PEGS_PROMPTS = {
    "Project": "다음 문서에서 프로젝트 이해관계자, 예산/일정 제약, 조직 맥락에 관한 요구사항을 추출하세요.",
    "Environment": "다음 문서에서 외부 시스템 인터페이스, 규제 요건, 운영 환경에 관한 요구사항을 추출하세요.",
    "Goals": "다음 문서에서 비즈니스 목표, 성공 기준, 사용자 기대에 관한 요구사항을 추출하세요.",
    "System": "다음 문서에서 기능 명세, 비기능 요구사항, 품질 속성에 관한 요구사항을 추출하세요."
}

def extract_requirements_pegs(document_text: str, provider="openai") -> dict:
    results = {}
    for category, prompt in PEGS_PROMPTS.items():
        full_prompt = f"{prompt}\n\n문서 내용:\n{document_text}\n\nJSON 배열로 반환하세요: [{{\"requirement\": \"...\", \"priority\": \"High/Medium/Low\"}}]"
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": full_prompt}]
        )
        results[category] = response.choices[0].message.content
    return results

def consensus_merge(results_gpt4: list, results_claude: list, threshold=0.5) -> list:
    """두 모델 결과를 비교해서 consensus가 낮은 항목 플래그 처리"""
    merged = []
    for req in results_gpt4:
        # 간단한 텍스트 유사도 체크 (실제론 cosine similarity 사용)
        found_in_claude = any(req["requirement"][:30] in r["requirement"] for r in results_claude)
        req["confidence"] = 1.0 if found_in_claude else 0.3
        req["needs_review"] = req["confidence"] < threshold
        merged.append(req)
    return merged

Terminology

PEGS소프트웨어 요구사항을 Project(프로젝트 맥락), Environment(외부 환경), Goals(목표), System(시스템 명세) 4가지로 나눠 정리하는 프레임워크. 마치 집 설계할 때 '땅 조건/동네 규제/원하는 생활/방 구성' 순서로 정리하는 것과 비슷.

hallucinationLLM이 실제로 없는 내용을 있는 것처럼 생성하는 현상. 문서에 없는 요구사항을 멋대로 만들어내는 것.

consensus mechanism여러 모델의 의견을 모아 다수결이나 가중 평균으로 최종 답을 내는 방식. 배심원 제도처럼 여러 명이 동의한 것만 채택.

F1 Score정밀도(맞다고 한 것 중 진짜 맞는 비율)와 재현율(실제 요구사항 중 찾아낸 비율)의 조화 평균. 1에 가까울수록 좋음.

ablation study시스템에서 특정 부품 하나를 제거해서 그게 얼마나 중요한지 검증하는 실험. '이 기능 빼면 얼마나 나빠지는지' 보는 것.

traceability요구사항이 어떤 문서에서 왔고, 이후 어떤 테스트/코드로 연결되는지 추적할 수 있는 능력. 버그 추적에서 '이 버그는 어떤 요구사항에서 시작됐지?' 물어볼 수 있게 해줌.

cosine similarity두 텍스트가 얼마나 비슷한지 벡터 각도로 측정하는 방법. 0이면 완전히 다르고 1이면 동일. 중복 요구사항 걸러낼 때 사용.

Related Resources

https://re-engineer-app-khalid.replit.app

Original Abstract (Expand)

Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.