Mixture-of-Agents Enhances Large Language Model Capabilities
TL;DR Highlight
An ensemble technique that stacks multiple open-source LLMs in a layer structure to achieve better performance than GPT-4o without fine-tuning.
Who Should Read
AI engineers or backend developers wanting to improve response quality by combining multiple LLMs. Especially production LLM pipeline managers looking to maximize performance-to-cost ratio.
Core Mechanics
- LLMs have 'collaborativeness' — their quality improves when referencing other models' responses, even if those models are weaker
- MoA structure: multiple LLMs (Proposers) in each layer independently generate responses → next layer LLMs (Aggregators) synthesize → repeat N times
- Achieves 65.1% on AlpacaEval 2.0 with open-source models only — 7.6%p above GPT-4o (57.5%)
- Works with prompt interface alone without fine-tuning — any LLM can be plugged in
- Model diversity is key: combining different models consistently outperforms using the same model multiple times (n=6 multi-proposer 61.3% vs single-proposer 56.7%)
- WizardLM is best as Proposer but not suitable as Aggregator — model characteristics differ by role
Evidence
- AlpacaEval 2.0 LC win rate: MoA 65.1%, MoA w/ GPT-4o 65.7% vs GPT-4o 57.5% (8.2%p difference)
- MT-Bench: MoA 9.25, MoA w/ GPT-4o 9.40 vs GPT-4 Turbo 9.31, GPT-4o 9.19
- Cost efficiency: MoA-Lite achieves AlpacaEval 59.3% at same cost as GPT-4o — 4% higher than GPT-4 Turbo at less than half the cost
- MATH dataset: Llama-3-70B-Instruct through layers 1→3 improves 0.456 → 0.584 → 0.578 reasoning accuracy
How to Apply
- Start directly with Together AI API: call Qwen1.5-110B, LLaMA-3-70B, WizardLM-8x22B, Mixtral-8x22B, dbrx as Proposers in parallel, then synthesize with Aggregate-and-Synthesize prompt
- For budget-conscious use, the MoA-Lite strategy: 2 layers + cheap Aggregator (Qwen1.5-72B) still achieves GPT-4o level or above
- Aggregator selection criteria: place models with strong synthesis ability (Qwen1.5-110B, GPT-4o) at the last layer, use models strong at providing diverse perspectives like WizardLM only as Proposers
Code Example
import asyncio
from openai import AsyncOpenAI
# Aggregate-and-Synthesize prompt (Paper Table 1)
AGGREGATE_PROMPT = """You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.
Responses from models:
{responses}"""
client = AsyncOpenAI(base_url="https://api.together.xyz/v1", api_key="YOUR_KEY")
PROPOSER_MODELS = [
"Qwen/Qwen1.5-72B-Chat",
"meta-llama/Llama-3-70b-chat-hf",
"mistralai/Mixtral-8x22B-Instruct-v0.1",
]
AGGREGATOR_MODEL = "Qwen/Qwen1.5-110B-Chat"
async def get_response(model, messages):
resp = await client.chat.completions.create(
model=model, messages=messages, max_tokens=2048
)
return resp.choices[0].message.content
async def moa_pipeline(user_query: str, num_layers: int = 2):
messages = [{"role": "user", "content": user_query}]
for layer in range(num_layers - 1):
# Call Proposers in parallel
responses = await asyncio.gather(
*[get_response(m, messages) for m in PROPOSER_MODELS]
)
# Generate input for the next layer
formatted = "\n".join([f"{i+1}. {r}" for i, r in enumerate(responses)])
aggregate_content = AGGREGATE_PROMPT.format(responses=formatted)
messages = [
{"role": "user", "content": user_query},
{"role": "assistant", "content": aggregate_content},
]
# Final layer: Aggregator
final = await get_response(AGGREGATOR_MODEL, messages)
return final
# Run
result = asyncio.run(moa_pipeline("Explain the concept of mixture of experts."))
print(result)Terminology
Related Resources
Original Abstract (Expand)
Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.