Enhancing diagnostic capability with multi-agents conversational large language models
TL;DR Highlight
Multiple AI doctor agents debating like an MDT meeting diagnose rare diseases more accurately than GPT-4 alone.
Who Should Read
Healthcare AI researchers and clinical informatics teams exploring multi-agent systems for complex medical decision support.
Core Mechanics
- Multi-agent framework where specialized AI agents (generalist, specialist, devil's advocate) debate differential diagnoses
- Structured debate protocol: each agent proposes, critiques, and defends diagnoses across multiple rounds
- Outperforms single GPT-4 on rare disease diagnosis benchmarks, especially for complex multi-system conditions
- Devil's advocate agent reduces premature diagnostic convergence (anchoring bias)
- Final diagnosis is determined by structured consensus, not simple majority vote
Evidence
- Evaluated on rare disease QA benchmarks (MIMIC, NEJM case records)
- Top-1 and Top-3 diagnostic accuracy compared against GPT-4 single-shot and chain-of-thought
- Multi-agent debate improved Top-1 accuracy by 8–12% on hard cases
How to Apply
- For high-stakes decisions with multiple plausible options, use a multi-agent debate structure rather than a single LLM query.
- Include an explicit 'devil's advocate' agent role to challenge the leading hypothesis and surface overlooked alternatives.
- Structure the debate with a fixed number of rounds and a consensus protocol to avoid endless loops.
Code Example
# Simple MAC pattern example using OpenAI API
import openai
client = openai.OpenAI()
def doctor_agent(case: str, specialty: str) -> str:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"You are an experienced {specialty} physician. Analyze the case and suggest a diagnosis with reasoning."},
{"role": "user", "content": case}
]
)
return response.choices[0].message.content
def supervisor_agent(case: str, doctor_opinions: list[str]) -> str:
opinions_text = "\n\n".join([f"Doctor {i+1}: {op}" for i, op in enumerate(doctor_opinions)])
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a senior supervising physician. Review all doctor opinions and provide a final integrated diagnosis."},
{"role": "user", "content": f"Case:\n{case}\n\nDoctor Opinions:\n{opinions_text}\n\nProvide the final diagnosis."},
]
)
return response.choices[0].message.content
# Usage example
case = "28-year-old patient with progressive muscle weakness, fatigue, and elevated CK levels..."
specialties = ["neurologist", "rheumatologist", "internal medicine specialist", "geneticist"]
# Run 4 Doctor agents
opinions = [doctor_agent(case, s) for s in specialties]
# Supervisor provides the final diagnosis
final_diagnosis = supervisor_agent(case, opinions)
print(final_diagnosis)Terminology
Original Abstract (Expand)
Large Language Models (LLMs) show promise in healthcare tasks but face challenges in complex medical scenarios. We developed a Multi-Agent Conversation (MAC) framework for disease diagnosis, inspired by clinical Multi-Disciplinary Team discussions. Using 302 rare disease cases, we evaluated GPT-3.5, GPT-4, and MAC on medical knowledge and clinical reasoning. MAC outperformed single models in both primary and follow-up consultations, achieving higher accuracy in diagnoses and suggested tests. Optimal performance was achieved with four doctor agents and a supervisor agent, using GPT-4 as the base model. MAC demonstrated high consistency across repeated runs. Further comparative analysis showed MAC also outperformed other methods including Chain of Thoughts (CoT), Self-Refine, and Self-Consistency with higher performance and more output tokens. This framework significantly enhanced LLMs’ diagnostic capabilities, effectively bridging theoretical knowledge and practical clinical application. Our findings highlight the potential of multi-agent LLMs in healthcare and suggest further research into their clinical implementation.