Computer Use Agent를 위한 Adaptive Vision-Language Model Routing

Adaptive Vision-Language Model Routing for Computer Use Agents

Mar 13, 2026•Xunzhuo Liu, Bowei He, Xue Liu +3•View PDF

TL;DR Highlight

GUI 자동화 에이전트에서 액션 난이도에 따라 7B/72B 모델을 자동 선택해 비용을 최대 78% 줄이는 라우팅 프레임워크

Who Should Read

데스크톱 자동화나 GUI 에이전트 시스템을 운영 중인 개발자. 특히 GPT-4o나 Claude 같은 대형 모델을 매 스텝마다 호출하는 CUA 파이프라인의 비용을 줄이고 싶은 엔지니어.

Core Mechanics

GPT-4o(~1.8T 파라미터)가 ScreenSpot-Pro에서 0.8% 정확도인 반면 OS-Atlas-7B(7B)는 18.9% — 모델 크기가 GUI 성능을 보장하지 않음
AVR(Adaptive VLM Routing): 각 액션마다 난이도를 추정하고, 7B 모델의 logprob 신뢰도를 확인해서 충분하면 7B, 아니면 72B로 라우팅
120M 파라미터짜리 SigLIP+MiniLM 임베더로 시각/텍스트 난이도를 각각 계산하고, 둘 중 높은 값을 최종 난이도로 사용
메모리(이전 UI 인터랙션 기록)를 7B 프롬프트에 주입하면 신뢰도가 0.83 → 0.96으로 올라가서 에스컬레이션 없이 7B가 처리 가능
안전 플래그가 붙은 고위험 액션(예: 'Delete All' 클릭)은 자동으로 72B + 가드레일로 라우팅되는 3-tier 구조
Qwen2.5-VL 7B/72B 조합 기준: cold 상태 52%, warm 상태 70%, warm+난이도분류 78% 비용 절감 (성능 차이 2%p 이내)

Evidence

OpenClaw 벤치마크에서 메모리 주입 시 7B 모델 신뢰도 0.83 → 0.96, 전체 턴의 100%를 7B만으로 처리, 비용 $0.0558 → $0.0080 (86% 절감)
ScreenSpot-Pro 26개 앱 기준 Qwen2.5-VL-72B의 앱별 정확도 편차 7배: VS Code 35%+ vs Premiere Pro 15% 미만
Warm AVR + 난이도 분류 시나리오: 에스컬레이션율 10%, 유효 정확도 42.8% (all-72B 기준 43.6%와 0.8%p 차이), 비용 $0.27 → $0.06
Visual Confused Deputy 가드레일 F1 = 0.889 (이미지만), 0.915 (이미지+텍스트 fusion) — 고위험 액션 탐지 정확도

How to Apply

CUA 오케스트레이터와 VLM 사이에 라우터 레이어를 추가: 각 tool call마다 스크린샷 크롭을 임베딩해 난이도 점수를 구하고, 7B에 non-streaming logprob 프로브를 날려서 confidence >= threshold면 7B 응답 사용, 아니면 72B로 재요청
에이전트가 특정 앱을 반복 사용할 때 '어디를 클릭했고 성공/실패했는지' 메모리를 벡터 DB에 저장해두고, 다음 세션에서 유사 액션에 해당 메모리를 7B 프롬프트에 주입 — cold start 후 5~10회 인터랙션이면 warm 효과 달성
10액션 이하 단발성 작업보다 엔터프라이즈 반복 자동화(테스트 파이프라인, RPA)에 적용할 때 효과적 — probe 오버헤드를 상쇄하려면 최소 10스텝 이상의 세션 필요

Code Example

snippet

# AVR 라우팅 로직 핵심 슈도코드
import numpy as np

def route_tool_call(screenshot_crop, action_desc, small_vlm, large_vlm, memory_store):
    # 1. 난이도 추정 (120M 임베더 사용)
    vis_emb = embed_image(screenshot_crop)   # SigLIP
    txt_emb = embed_text(action_desc)        # MiniLM-L6-v2
    
    d_vis = max_cosine_sim(vis_emb, hard_ui_prototypes)
    d_sem = max_cosine_sim(txt_emb, hard_desc_prototypes)
    difficulty = max(d_vis, d_sem)  # 보수적 추정

    # 2. 안전 체크 (Visual Confused Deputy)
    if is_high_risk(vis_emb, txt_emb):
        return large_vlm.generate(prompt, guardrail=True)
    
    # 3. 쉬운 액션은 프로브 없이 7B로
    if difficulty < 0.3:
        return small_vlm.generate(prompt)
    
    # 4. 메모리 주입 후 7B 프로브
    memories = memory_store.retrieve(action_desc, top_k=5)
    augmented_prompt = inject_memories(prompt, memories)
    
    response, logprobs = small_vlm.probe(augmented_prompt)  # non-streaming
    
    # 5. 난이도 적응형 threshold
    tau_easy, tau_hard = 0.80, 0.92
    threshold = tau_easy + (tau_hard - tau_easy) * difficulty
    
    l_min = -3.0
    confidence = (np.mean(logprobs) + abs(l_min)) / abs(l_min)
    
    if confidence >= threshold:
        memory_store.save_success(action_desc, response)  # 메모리 누적
        return response
    else:
        return large_vlm.generate(prompt)

Terminology

CUA (Computer Use Agent)스크린샷을 보고 클릭/타이핑/스크롤 같은 실제 컴퓨터 조작을 수행하는 AI 에이전트. 사람 대신 GUI를 자동으로 조작함.

VLM (Vision-Language Model)이미지와 텍스트를 동시에 이해하는 멀티모달 모델. GPT-4o, Qwen2.5-VL 같은 모델이 여기 해당.

GUI Grounding화면에서 'Submit 버튼을 클릭해'라는 명령을 받았을 때 정확한 픽셀 좌표를 찾아내는 능력. 사람으로 치면 화면에서 원하는 버튼을 눈으로 찾는 것.

logprob (log probability)모델이 다음 토큰을 얼마나 확신하는지 나타내는 숫자. 값이 0에 가까울수록 자신감 높음, 음수가 클수록 불확실.

Escalation싸고 작은 모델이 자신 없을 때 비싸고 큰 모델로 요청을 넘기는 것. 콜센터에서 상담원이 어려운 문제를 매니저에게 넘기는 것과 비슷.

Warm Agent이전 인터랙션 기록(어떤 버튼이 어디 있었는지 등)을 메모리에 쌓아둔 에이전트. Cold Agent는 기억이 없어 매번 처음부터 파악해야 함.

SigLIP이미지를 숫자 벡터로 변환하는 경량 임베딩 모델. 이미지의 '내용'을 수치화해서 유사도 비교에 사용.

Contrastive KB (Knowledge Base)쉬운 UI 요소와 어려운 UI 요소의 예시 임베딩을 모아둔 참조 데이터베이스. 새 액션이 들어오면 이 DB와 비교해 난이도를 추정함.

Related Resources

Original Abstract (Expand)

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.