Enhancing diagnostic capability with multi-agents conversational large language models

Jan 1, 2025•Xi Chen, Huahui Yi, M. You +11•View PDF

TL;DR Highlight

Multiple AI doctor agents debating like an MDT meeting diagnose rare diseases more accurately than GPT-4 alone.

Who Should Read

Healthcare AI researchers and clinical informatics teams exploring multi-agent systems for complex medical decision support.

Core Mechanics

Multi-agent framework where specialized AI agents (generalist, specialist, devil's advocate) debate differential diagnoses
Structured debate protocol: each agent proposes, critiques, and defends diagnoses across multiple rounds
Outperforms single GPT-4 on rare disease diagnosis benchmarks, especially for complex multi-system conditions
Devil's advocate agent reduces premature diagnostic convergence (anchoring bias)
Final diagnosis is determined by structured consensus, not simple majority vote

Evidence

Evaluated on rare disease QA benchmarks (MIMIC, NEJM case records)
Top-1 and Top-3 diagnostic accuracy compared against GPT-4 single-shot and chain-of-thought
Multi-agent debate improved Top-1 accuracy by 8–12% on hard cases

How to Apply

For high-stakes decisions with multiple plausible options, use a multi-agent debate structure rather than a single LLM query.
Include an explicit 'devil's advocate' agent role to challenge the leading hypothesis and surface overlooked alternatives.
Structure the debate with a fixed number of rounds and a consensus protocol to avoid endless loops.

Code Example

snippet

# Simple MAC pattern example using OpenAI API
import openai

client = openai.OpenAI()

def doctor_agent(case: str, specialty: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"You are an experienced {specialty} physician. Analyze the case and suggest a diagnosis with reasoning."},
            {"role": "user", "content": case}
        ]
    )
    return response.choices[0].message.content

def supervisor_agent(case: str, doctor_opinions: list[str]) -> str:
    opinions_text = "\n\n".join([f"Doctor {i+1}: {op}" for i, op in enumerate(doctor_opinions)])
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a senior supervising physician. Review all doctor opinions and provide a final integrated diagnosis."},
            {"role": "user", "content": f"Case:\n{case}\n\nDoctor Opinions:\n{opinions_text}\n\nProvide the final diagnosis."},
        ]
    )
    return response.choices[0].message.content

# Usage example
case = "28-year-old patient with progressive muscle weakness, fatigue, and elevated CK levels..."
specialties = ["neurologist", "rheumatologist", "internal medicine specialist", "geneticist"]

# Run 4 Doctor agents
opinions = [doctor_agent(case, s) for s in specialties]

# Supervisor provides the final diagnosis
final_diagnosis = supervisor_agent(case, opinions)
print(final_diagnosis)

Terminology

MDTMulti-Disciplinary Team. A clinical meeting where specialists from different fields collaborate to diagnose and plan treatment.

Differential DiagnosisA systematic method of identifying a disease by ruling out alternatives based on symptoms and test results.

Anchoring BiasThe cognitive tendency to over-rely on the first piece of information encountered, leading to premature diagnostic closure.

Devil's AdvocateA role assigned to argue against the prevailing hypothesis to expose weaknesses and force consideration of alternatives.

Original Abstract (Expand)

Large Language Models (LLMs) show promise in healthcare tasks but face challenges in complex medical scenarios. We developed a Multi-Agent Conversation (MAC) framework for disease diagnosis, inspired by clinical Multi-Disciplinary Team discussions. Using 302 rare disease cases, we evaluated GPT-3.5, GPT-4, and MAC on medical knowledge and clinical reasoning. MAC outperformed single models in both primary and follow-up consultations, achieving higher accuracy in diagnoses and suggested tests. Optimal performance was achieved with four doctor agents and a supervisor agent, using GPT-4 as the base model. MAC demonstrated high consistency across repeated runs. Further comparative analysis showed MAC also outperformed other methods including Chain of Thoughts (CoT), Self-Refine, and Self-Consistency with higher performance and more output tokens. This framework significantly enhanced LLMs’ diagnostic capabilities, effectively bridging theoretical knowledge and practical clinical application. Our findings highlight the potential of multi-agent LLMs in healthcare and suggest further research into their clinical implementation.