ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

Mar 24, 2026•Muhammad Khalid, Manuel Oriol, Yilmaz Uygun•View PDF

TL;DR Highlight

Running GPT-4, Claude-3, and Groq simultaneously to automatically extract software requirements achieves F1 0.88 and reduces analysis time by 78%.

Who Should Read

PMs or backend developers who repeatedly analyze software requirements documents (RFPs, proposals, technical specifications). Especially teams looking to reduce incorrect requirement extraction caused by LLM hallucination.

Core Mechanics

Structuring prompts into 4 PEGS categories (Project/Environment/Goals/System) improves F1 from 0.71 → 0.88 compared to generic 'extract everything' prompts
Running GPT-4, Claude-3, and Groq/Llama in parallel and merging results via consensus (voting) yields higher accuracy than a single model (GPT-4 alone F1 0.81 → multi-model 0.88)
Requirements agreed upon by multiple models have an 8% false positive rate, while requirements extracted by only one model have a 34% false positive rate — consensus acts as a hallucination filter
Parallel processing reduces response latency by 71%, from 4.2s → 1.2s (compared to sequential processing)
Cost is also reduced by 47%, from $0.082 → $0.043 per requirement (routing simple classification to Groq/Llama at lower cost)
The Environment category has the lowest F1 at 0.79, primarily because implicit requirements buried in legal clauses are not captured

Evidence

"PEGS structured prompt F1 0.88 vs generic prompt F1 0.71, absolute difference +0.17 (ablation test under identical multi-provider settings); 78% reduction in analysis time compared to manual analysis based on 1,050 requirements (manual 4.9 min/requirement → automated 1.1 min); multi-provider consensus false positive rate 8% vs single provider 34% (based on 200 samples); PEGS category coverage improved from 61.3% (generic) → 92.0% (PEGS prompt), a 30.7 percentage point improvement"

How to Apply

"Instead of using a single requirements extraction prompt, split it into 4 separate prompts for Project/Environment/Goals/System and send each independently. Rather than 'extract all requirements,' use prompts like 'extract Project-related stakeholders, budget, and schedule constraints.'; For critical requirements, send the same input to both GPT-4 and Claude-3, compare the two results, and flag items extracted by only one model as 'needs review.' If cost is a concern, run the first pass with Groq/Llama and reprocess only low-confidence items with a higher-end model.; Attaching the source page and PEGS category as metadata to each extracted requirement makes it easier to link them later to test cases and design documents (traceability)."

Code Example

snippet

# Example of PEGS category-based structured prompts (Python)
import openai

PEGS_PROMPTS = {
    "Project": "Extract requirements related to project stakeholders, budget/schedule constraints, and organizational context from the following document.",
    "Environment": "Extract requirements related to external system interfaces, regulatory requirements, and operational environment from the following document.",
    "Goals": "Extract requirements related to business objectives, success criteria, and user expectations from the following document.",
    "System": "Extract requirements related to functional specifications, non-functional requirements, and quality attributes from the following document."
}

def extract_requirements_pegs(document_text: str, provider="openai") -> dict:
    results = {}
    for category, prompt in PEGS_PROMPTS.items():
        full_prompt = f"{prompt}\n\nDocument content:\n{document_text}\n\nReturn as a JSON array: [{{\"requirement\": \"...\", \"priority\": \"High/Medium/Low\"}}]"
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": full_prompt}]
        )
        results[category] = response.choices[0].message.content
    return results

def consensus_merge(results_gpt4: list, results_claude: list, threshold=0.5) -> list:
    """Compare results from two models and flag items with low consensus"""
    merged = []
    for req in results_gpt4:
        # Simple text similarity check (in practice, use cosine similarity)
        found_in_claude = any(req["requirement"][:30] in r["requirement"] for r in results_claude)
        req["confidence"] = 1.0 if found_in_claude else 0.3
        req["needs_review"] = req["confidence"] < threshold
        merged.append(req)
    return merged

Terminology

PEGSA framework for organizing software requirements into 4 categories: Project (project context), Environment (external environment), Goals (objectives), and System (system specifications). Similar to planning a house by organizing 'land conditions / neighborhood regulations / desired lifestyle / room layout' in order.

hallucinationThe phenomenon where an LLM generates content that doesn't actually exist as if it does. Fabricating requirements that are not present in the document.

consensus mechanismA method of aggregating opinions from multiple models to produce a final answer via majority vote or weighted average. Like a jury system — only what multiple members agree on is accepted.

F1 ScoreThe harmonic mean of precision (the proportion of predicted requirements that are actually correct) and recall (the proportion of actual requirements that were found). Closer to 1 is better.

ablation studyAn experiment that removes a specific component from a system to verify how important it is. Essentially checking 'how much worse does it get without this feature?'

traceabilityThe ability to track where a requirement came from in a document and how it connects to subsequent tests and code. Allows you to ask 'which requirement did this bug originate from?' during bug tracking.

cosine similarityA method of measuring how similar two texts are by the angle between their vectors. 0 means completely different, 1 means identical. Used to filter out duplicate requirements.

Related Resources

https://re-engineer-app-khalid.replit.app

Original Abstract (Expand)

Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.