Adaptive Vision-Language Model Routing for Computer Use Agents
TL;DR Highlight
A routing framework for GUI automation agents that auto-selects between 7B/72B models based on action difficulty, cutting costs up to 78%.
Who Should Read
Teams deploying GUI automation agents in production who need to balance capability and cost, and researchers working on adaptive model selection for agents.
Core Mechanics
- Proposed a routing framework that dynamically selects between a cheap small model (7B) and an expensive large model (72B) based on estimated action difficulty
- Difficulty estimation is fast and lightweight — doesn't require running the large model to decide
- Achieves up to 78% cost reduction compared to always using the 72B model
- Performance is maintained or improved compared to always using the small model
- The routing decision is based on visual and textual complexity signals from the current GUI state
- Framework is modular and can work with any pair of models — not limited to specific architectures
Evidence
- 78% cost reduction vs always-72B baseline on GUI automation benchmarks
- Performance matches or exceeds always-7B baseline on task completion rate
- Difficulty estimation adds negligible latency overhead
- Routing decisions are interpretable — can audit which actions triggered large model use
How to Apply
- Deploy alongside existing GUI agent systems as an inference-time routing layer
- Tune the difficulty threshold based on your cost/performance requirements — lower threshold means more 72B calls, higher quality but more cost
- Start by profiling your actual task distribution to calibrate difficulty signals before production deployment
Code Example
# AVR routing logic core pseudocode
import numpy as np
def route_tool_call(screenshot_crop, action_desc, small_vlm, large_vlm, memory_store):
# 1. Difficulty estimation (using 120M embedder)
vis_emb = embed_image(screenshot_crop) # SigLIP
txt_emb = embed_text(action_desc) # MiniLM-L6-v2
d_vis = max_cosine_sim(vis_emb, hard_ui_prototypes)
d_sem = max_cosine_sim(txt_emb, hard_desc_prototypes)
difficulty = max(d_vis, d_sem) # conservative estimation
# 2. Safety check (Visual Confused Deputy)
if is_high_risk(vis_emb, txt_emb):
return large_vlm.generate(prompt, guardrail=True)
# 3. Easy actions go to 7B without probing
if difficulty < 0.3:
return small_vlm.generate(prompt)
# 4. Probe 7B after memory injection
memories = memory_store.retrieve(action_desc, top_k=5)
augmented_prompt = inject_memories(prompt, memories)
response, logprobs = small_vlm.probe(augmented_prompt) # non-streaming
# 5. Difficulty-adaptive threshold
tau_easy, tau_hard = 0.80, 0.92
threshold = tau_easy + (tau_hard - tau_easy) * difficulty
l_min = -3.0
confidence = (np.mean(logprobs) + abs(l_min)) / abs(l_min)
if confidence >= threshold:
memory_store.save_success(action_desc, response) # accumulate memory
return response
else:
return large_vlm.generate(prompt)Terminology
Related Resources
Original Abstract (Expand)
Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.