ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains
TL;DR Highlight
Running GPT-4, Claude-3, and Groq simultaneously to automatically extract software requirements achieves F1 0.88 and reduces analysis time by 78%.
Who Should Read
PMs or backend developers who repeatedly analyze software requirements documents (RFPs, proposals, technical specifications). Especially teams looking to reduce incorrect requirement extraction caused by LLM hallucination.
Core Mechanics
- Structuring prompts into 4 PEGS categories (Project/Environment/Goals/System) improves F1 from 0.71 → 0.88 compared to generic 'extract everything' prompts
- Running GPT-4, Claude-3, and Groq/Llama in parallel and merging results via consensus (voting) yields higher accuracy than a single model (GPT-4 alone F1 0.81 → multi-model 0.88)
- Requirements agreed upon by multiple models have an 8% false positive rate, while requirements extracted by only one model have a 34% false positive rate — consensus acts as a hallucination filter
- Parallel processing reduces response latency by 71%, from 4.2s → 1.2s (compared to sequential processing)
- Cost is also reduced by 47%, from $0.082 → $0.043 per requirement (routing simple classification to Groq/Llama at lower cost)
- The Environment category has the lowest F1 at 0.79, primarily because implicit requirements buried in legal clauses are not captured
Evidence
- "PEGS structured prompt F1 0.88 vs generic prompt F1 0.71, absolute difference +0.17 (ablation test under identical multi-provider settings); 78% reduction in analysis time compared to manual analysis based on 1,050 requirements (manual 4.9 min/requirement → automated 1.1 min); multi-provider consensus false positive rate 8% vs single provider 34% (based on 200 samples); PEGS category coverage improved from 61.3% (generic) → 92.0% (PEGS prompt), a 30.7 percentage point improvement"
How to Apply
- "Instead of using a single requirements extraction prompt, split it into 4 separate prompts for Project/Environment/Goals/System and send each independently. Rather than 'extract all requirements,' use prompts like 'extract Project-related stakeholders, budget, and schedule constraints.'; For critical requirements, send the same input to both GPT-4 and Claude-3, compare the two results, and flag items extracted by only one model as 'needs review.' If cost is a concern, run the first pass with Groq/Llama and reprocess only low-confidence items with a higher-end model.; Attaching the source page and PEGS category as metadata to each extracted requirement makes it easier to link them later to test cases and design documents (traceability)."
Code Example
# Example of PEGS category-based structured prompts (Python)
import openai
PEGS_PROMPTS = {
"Project": "Extract requirements related to project stakeholders, budget/schedule constraints, and organizational context from the following document.",
"Environment": "Extract requirements related to external system interfaces, regulatory requirements, and operational environment from the following document.",
"Goals": "Extract requirements related to business objectives, success criteria, and user expectations from the following document.",
"System": "Extract requirements related to functional specifications, non-functional requirements, and quality attributes from the following document."
}
def extract_requirements_pegs(document_text: str, provider="openai") -> dict:
results = {}
for category, prompt in PEGS_PROMPTS.items():
full_prompt = f"{prompt}\n\nDocument content:\n{document_text}\n\nReturn as a JSON array: [{{\"requirement\": \"...\", \"priority\": \"High/Medium/Low\"}}]"
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": full_prompt}]
)
results[category] = response.choices[0].message.content
return results
def consensus_merge(results_gpt4: list, results_claude: list, threshold=0.5) -> list:
"""Compare results from two models and flag items with low consensus"""
merged = []
for req in results_gpt4:
# Simple text similarity check (in practice, use cosine similarity)
found_in_claude = any(req["requirement"][:30] in r["requirement"] for r in results_claude)
req["confidence"] = 1.0 if found_in_claude else 0.3
req["needs_review"] = req["confidence"] < threshold
merged.append(req)
return mergedTerminology
Related Resources
Original Abstract (Expand)
Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.