Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization

Mar 13, 2026•Xudong Wang, Chaoning Zhang, Jiaquan Zhang +8•View PDF

TL;DR Highlight

An ant colony optimization-based routing framework that smartly distributes queries across multiple LLM agents — cutting costs while achieving 4.7x throughput improvement.

Who Should Read

Teams running multi-agent LLM systems at scale who need to optimize cost and throughput, and researchers working on query routing for heterogeneous LLM deployments.

Core Mechanics

Applied ant colony optimization (ACO) algorithm to the problem of routing queries across a fleet of LLM agents with different capabilities and costs
ACO-based routing adapts dynamically to agent load, performance, and cost to find optimal routing paths
Achieves 4.7x throughput improvement compared to static routing strategies
Cost reduction comes from routing simple queries to cheaper models and complex ones to capable models
The routing policy continuously updates based on observed agent performance (pheromone-like feedback)
Works with heterogeneous agent pools mixing different model families and sizes

Evidence

4.7x throughput improvement over baseline static routing on benchmark query workloads
Significant cost reduction by routing ~60-70% of queries to smaller, cheaper models
Quality maintained: overall task completion rate matches or exceeds always-best-model baseline
Dynamic adaptation handles agent failures and load spikes gracefully

How to Apply

Deploy as a routing layer in front of your multi-agent system — queries enter the router, which dispatches to appropriate agents
Seed the ACO algorithm with initial routing weights based on your known agent capabilities
Monitor the pheromone trail evolution to understand which agents are being favored and why

Code Example

snippet

# AMRO-S Core Logic Pseudocode

# 1. Classify query intent with SFT-trained small router
router = SFTSmallLanguageModel('Llama-3.2-1B-Instruct')
w = router.get_task_weights(query)  # {'math': 0.8, 'code': 0.1, 'general': 0.1}

# 2. Fuse task-specific pheromone matrices
# tau_math, tau_code, tau_general: path preference matrices for each task
tau_fused = sum(w[t] * tau[t] for t in ['math', 'code', 'general'])

# 3. Compute ACO transition probabilities (balance exploitation vs exploration)
def get_transition_prob(tau_fused, eta, alpha=1.0, beta=2.0, gamma=0.1):
    scores = {}
    for j in allowed_nodes:
        scores[j] = (tau_fused[i][j] ** alpha) * (eta[j] ** beta)
    total = sum(scores.values())
    probs = {j: s / total for j, s in scores.items()}
    # Ensure minimum exploration (epsilon-greedy approach)
    final_probs = {
        j: gamma * (1/len(allowed_nodes)) + (1-gamma) * probs[j]
        for j in allowed_nodes
    }
    return final_probs

# 4. Sample path and execute agents
path = sample_path(get_transition_prob(tau_fused, eta))
output = execute_agents(path, query)

# 5. Asynchronous quality-gating update (no impact on serving latency)
if llm_judge(query, path, output) == 1:  # Only high-quality results pass
    for t in task_types:
        for (i,j) in path:
            tau[t][i][j] = (1-rho) * tau[t][i][j] + w[t] * Q / (cost(path) + eps)

Terminology

Ant Colony Optimization (ACO)A metaheuristic algorithm inspired by ant foraging behavior, where 'pheromone trails' reinforce good solutions over time.

Query RoutingDeciding which model or agent should handle each incoming query based on query characteristics and system state.

ThroughputThe number of queries processed per unit time — a key operational metric for LLM serving systems.

Heterogeneous Agent PoolA collection of agents with different capabilities, costs, and specializations that can collectively handle diverse queries.

Original Abstract (Expand)

Large Language Model (LLM)-driven Multi-Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality--cost trade-off space. Despite these advances, real-world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM-based selectors or static policies, and offer limited controllability for semantic-aware routing under dynamic loads and mixed intents, often resulting in unstable performance and inefficient resource utilization. To address these limitations, we propose AMRO-S, an efficient and interpretable routing framework for Multi-Agent Systems (MAS). AMRO-S models MAS routing as a semantic-conditioned path selection problem, enhancing routing performance through three key mechanisms: First, it leverages a supervised fine-tuned (SFT) small language model for intent inference, providing a low-overhead semantic interface for each query; second, it decomposes routing memory into task-specific pheromone specialists, reducing cross-task interference and optimizing path selection under mixed workloads; finally, it employs a quality-gated asynchronous update mechanism to decouple inference from learning, optimizing routing without increasing latency. Extensive experiments on five public benchmarks and high-concurrency stress tests demonstrate that AMRO-S consistently improves the quality--cost trade-off over strong routing baselines, while providing traceable routing evidence through structured pheromone patterns.