Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization
TL;DR Highlight
An ant colony optimization-based routing framework that smartly distributes queries across multiple LLM agents — cutting costs while achieving 4.7x throughput improvement.
Who Should Read
Teams running multi-agent LLM systems at scale who need to optimize cost and throughput, and researchers working on query routing for heterogeneous LLM deployments.
Core Mechanics
- Applied ant colony optimization (ACO) algorithm to the problem of routing queries across a fleet of LLM agents with different capabilities and costs
- ACO-based routing adapts dynamically to agent load, performance, and cost to find optimal routing paths
- Achieves 4.7x throughput improvement compared to static routing strategies
- Cost reduction comes from routing simple queries to cheaper models and complex ones to capable models
- The routing policy continuously updates based on observed agent performance (pheromone-like feedback)
- Works with heterogeneous agent pools mixing different model families and sizes
Evidence
- 4.7x throughput improvement over baseline static routing on benchmark query workloads
- Significant cost reduction by routing ~60-70% of queries to smaller, cheaper models
- Quality maintained: overall task completion rate matches or exceeds always-best-model baseline
- Dynamic adaptation handles agent failures and load spikes gracefully
How to Apply
- Deploy as a routing layer in front of your multi-agent system — queries enter the router, which dispatches to appropriate agents
- Seed the ACO algorithm with initial routing weights based on your known agent capabilities
- Monitor the pheromone trail evolution to understand which agents are being favored and why
Code Example
# AMRO-S Core Logic Pseudocode
# 1. Classify query intent with SFT-trained small router
router = SFTSmallLanguageModel('Llama-3.2-1B-Instruct')
w = router.get_task_weights(query) # {'math': 0.8, 'code': 0.1, 'general': 0.1}
# 2. Fuse task-specific pheromone matrices
# tau_math, tau_code, tau_general: path preference matrices for each task
tau_fused = sum(w[t] * tau[t] for t in ['math', 'code', 'general'])
# 3. Compute ACO transition probabilities (balance exploitation vs exploration)
def get_transition_prob(tau_fused, eta, alpha=1.0, beta=2.0, gamma=0.1):
scores = {}
for j in allowed_nodes:
scores[j] = (tau_fused[i][j] ** alpha) * (eta[j] ** beta)
total = sum(scores.values())
probs = {j: s / total for j, s in scores.items()}
# Ensure minimum exploration (epsilon-greedy approach)
final_probs = {
j: gamma * (1/len(allowed_nodes)) + (1-gamma) * probs[j]
for j in allowed_nodes
}
return final_probs
# 4. Sample path and execute agents
path = sample_path(get_transition_prob(tau_fused, eta))
output = execute_agents(path, query)
# 5. Asynchronous quality-gating update (no impact on serving latency)
if llm_judge(query, path, output) == 1: # Only high-quality results pass
for t in task_types:
for (i,j) in path:
tau[t][i][j] = (1-rho) * tau[t][i][j] + w[t] * Q / (cost(path) + eps)Terminology
Original Abstract (Expand)
Large Language Model (LLM)-driven Multi-Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality--cost trade-off space. Despite these advances, real-world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM-based selectors or static policies, and offer limited controllability for semantic-aware routing under dynamic loads and mixed intents, often resulting in unstable performance and inefficient resource utilization. To address these limitations, we propose AMRO-S, an efficient and interpretable routing framework for Multi-Agent Systems (MAS). AMRO-S models MAS routing as a semantic-conditioned path selection problem, enhancing routing performance through three key mechanisms: First, it leverages a supervised fine-tuned (SFT) small language model for intent inference, providing a low-overhead semantic interface for each query; second, it decomposes routing memory into task-specific pheromone specialists, reducing cross-task interference and optimizing path selection under mixed workloads; finally, it employs a quality-gated asynchronous update mechanism to decouple inference from learning, optimizing routing without increasing latency. Extensive experiments on five public benchmarks and high-concurrency stress tests demonstrate that AMRO-S consistently improves the quality--cost trade-off over strong routing baselines, while providing traceable routing evidence through structured pheromone patterns.