Mixture-of-Agents Enhances Large Language Model Capabilities

Jun 7, 2024•Junlin Wang, Jue Wang, Ben Athiwaratkun +2•View PDF

TL;DR Highlight

An ensemble technique that stacks multiple open-source LLMs in a layer structure to achieve better performance than GPT-4o without fine-tuning.

Who Should Read

AI engineers or backend developers wanting to improve response quality by combining multiple LLMs. Especially production LLM pipeline managers looking to maximize performance-to-cost ratio.

Core Mechanics

LLMs have 'collaborativeness' — their quality improves when referencing other models' responses, even if those models are weaker
MoA structure: multiple LLMs (Proposers) in each layer independently generate responses → next layer LLMs (Aggregators) synthesize → repeat N times
Achieves 65.1% on AlpacaEval 2.0 with open-source models only — 7.6%p above GPT-4o (57.5%)
Works with prompt interface alone without fine-tuning — any LLM can be plugged in
Model diversity is key: combining different models consistently outperforms using the same model multiple times (n=6 multi-proposer 61.3% vs single-proposer 56.7%)
WizardLM is best as Proposer but not suitable as Aggregator — model characteristics differ by role

Evidence

AlpacaEval 2.0 LC win rate: MoA 65.1%, MoA w/ GPT-4o 65.7% vs GPT-4o 57.5% (8.2%p difference)
MT-Bench: MoA 9.25, MoA w/ GPT-4o 9.40 vs GPT-4 Turbo 9.31, GPT-4o 9.19
Cost efficiency: MoA-Lite achieves AlpacaEval 59.3% at same cost as GPT-4o — 4% higher than GPT-4 Turbo at less than half the cost
MATH dataset: Llama-3-70B-Instruct through layers 1→3 improves 0.456 → 0.584 → 0.578 reasoning accuracy

How to Apply

Start directly with Together AI API: call Qwen1.5-110B, LLaMA-3-70B, WizardLM-8x22B, Mixtral-8x22B, dbrx as Proposers in parallel, then synthesize with Aggregate-and-Synthesize prompt
For budget-conscious use, the MoA-Lite strategy: 2 layers + cheap Aggregator (Qwen1.5-72B) still achieves GPT-4o level or above
Aggregator selection criteria: place models with strong synthesis ability (Qwen1.5-110B, GPT-4o) at the last layer, use models strong at providing diverse perspectives like WizardLM only as Proposers

Code Example

snippet

import asyncio
from openai import AsyncOpenAI

# Aggregate-and-Synthesize prompt (Paper Table 1)
AGGREGATE_PROMPT = """You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.

Responses from models:
{responses}"""

client = AsyncOpenAI(base_url="https://api.together.xyz/v1", api_key="YOUR_KEY")

PROPOSER_MODELS = [
    "Qwen/Qwen1.5-72B-Chat",
    "meta-llama/Llama-3-70b-chat-hf",
    "mistralai/Mixtral-8x22B-Instruct-v0.1",
]
AGGREGATOR_MODEL = "Qwen/Qwen1.5-110B-Chat"

async def get_response(model, messages):
    resp = await client.chat.completions.create(
        model=model, messages=messages, max_tokens=2048
    )
    return resp.choices[0].message.content

async def moa_pipeline(user_query: str, num_layers: int = 2):
    messages = [{"role": "user", "content": user_query}]
    
    for layer in range(num_layers - 1):
        # Call Proposers in parallel
        responses = await asyncio.gather(
            *[get_response(m, messages) for m in PROPOSER_MODELS]
        )
        # Generate input for the next layer
        formatted = "\n".join([f"{i+1}. {r}" for i, r in enumerate(responses)])
        aggregate_content = AGGREGATE_PROMPT.format(responses=formatted)
        messages = [
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": aggregate_content},
        ]
    
    # Final layer: Aggregator
    final = await get_response(AGGREGATOR_MODEL, messages)
    return final

# Run
result = asyncio.run(moa_pipeline("Explain the concept of mixture of experts."))
print(result)

Terminology

MoA (Mixture-of-Agents)Stacking multiple LLMs in layers where each layer's output is synthesized by the next. Like multiple employees writing drafts and a manager doing the final summary.

ProposerThe LLM role generating initial responses in MoA. Providing diverse perspectives is key — doesn't have to be the highest performance model.

AggregatorThe LLM role receiving multiple Proposer responses and synthesizing into one high-quality answer. Synthesis ability is important, so stronger models are appropriate.

AlpacaEval 2.0A benchmark automatically evaluating LLM response quality. Expresses as rate (LC win rate) whether the model is 'better than GPT-4' as a baseline.

LC win rateLength-Controlled win rate. Win rate metric corrected for response length bias. GPT-4 style models tend to write long responses, so length correction makes evaluation fairer.

Related Resources

Original Abstract (Expand)

Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.