Adaptive Vision-Language Model Routing for Computer Use Agents

Mar 13, 2026•Xunzhuo Liu, Bowei He, Xue Liu +3•View PDF

TL;DR Highlight

A routing framework for GUI automation agents that auto-selects between 7B/72B models based on action difficulty, cutting costs up to 78%.

Who Should Read

Teams deploying GUI automation agents in production who need to balance capability and cost, and researchers working on adaptive model selection for agents.

Core Mechanics

Proposed a routing framework that dynamically selects between a cheap small model (7B) and an expensive large model (72B) based on estimated action difficulty
Difficulty estimation is fast and lightweight — doesn't require running the large model to decide
Achieves up to 78% cost reduction compared to always using the 72B model
Performance is maintained or improved compared to always using the small model
The routing decision is based on visual and textual complexity signals from the current GUI state
Framework is modular and can work with any pair of models — not limited to specific architectures

Evidence

78% cost reduction vs always-72B baseline on GUI automation benchmarks
Performance matches or exceeds always-7B baseline on task completion rate
Difficulty estimation adds negligible latency overhead
Routing decisions are interpretable — can audit which actions triggered large model use

How to Apply

Deploy alongside existing GUI agent systems as an inference-time routing layer
Tune the difficulty threshold based on your cost/performance requirements — lower threshold means more 72B calls, higher quality but more cost
Start by profiling your actual task distribution to calibrate difficulty signals before production deployment

Code Example

snippet

# AVR routing logic core pseudocode
import numpy as np

def route_tool_call(screenshot_crop, action_desc, small_vlm, large_vlm, memory_store):
    # 1. Difficulty estimation (using 120M embedder)
    vis_emb = embed_image(screenshot_crop)   # SigLIP
    txt_emb = embed_text(action_desc)        # MiniLM-L6-v2
    
    d_vis = max_cosine_sim(vis_emb, hard_ui_prototypes)
    d_sem = max_cosine_sim(txt_emb, hard_desc_prototypes)
    difficulty = max(d_vis, d_sem)  # conservative estimation

    # 2. Safety check (Visual Confused Deputy)
    if is_high_risk(vis_emb, txt_emb):
        return large_vlm.generate(prompt, guardrail=True)
    
    # 3. Easy actions go to 7B without probing
    if difficulty < 0.3:
        return small_vlm.generate(prompt)
    
    # 4. Probe 7B after memory injection
    memories = memory_store.retrieve(action_desc, top_k=5)
    augmented_prompt = inject_memories(prompt, memories)
    
    response, logprobs = small_vlm.probe(augmented_prompt)  # non-streaming
    
    # 5. Difficulty-adaptive threshold
    tau_easy, tau_hard = 0.80, 0.92
    threshold = tau_easy + (tau_hard - tau_easy) * difficulty
    
    l_min = -3.0
    confidence = (np.mean(logprobs) + abs(l_min)) / abs(l_min)
    
    if confidence >= threshold:
        memory_store.save_success(action_desc, response)  # accumulate memory
        return response
    else:
        return large_vlm.generate(prompt)

Terminology

GUI Automation AgentAn AI agent that controls graphical user interfaces — clicking buttons, filling forms, navigating apps — to complete tasks.

Model RoutingDynamically selecting which model to use for each input based on estimated complexity or difficulty.

Difficulty EstimationPredicting how hard an input is for the model before actually processing it, used to decide whether to escalate to a larger model.

Cost-Performance TradeoffThe fundamental tension between using cheaper, faster, smaller models vs more expensive but capable larger models.

Related Resources

Original Abstract (Expand)

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.