GUI-R1: Reinforcement Fine-Tuning 기반 범용 GUI 에이전트

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Apr 14, 2025•Run Luo, Lu Wang, Wanwei He +1•View PDF

TL;DR Highlight

데이터 0.02%만 써서 기존 SOTA보다 GUI 조작 능력이 더 좋은 RL 기반 에이전트 학습법.

Who Should Read

컴퓨터 화면을 보고 클릭·타이핑하는 GUI 자동화 에이전트를 만들거나 평가하는 ML 엔지니어. LLM 파인튜닝 비용이 부담스럽고 데이터 효율을 높이고 싶은 개발자.

Core Mechanics

SFT(정답 보여주고 따라하게 하는 학습)는 수천만 개 데이터가 필요하지만, RFT(강화학습 기반 파인튜닝)는 단 3K 샘플로 더 좋은 성능 달성
Windows/Linux/MacOS/Android/Web 5개 플랫폼의 액션을 '통합 액션 스페이스'로 묶어서 플랫폼 간 충돌 없이 한 번에 학습
보상 함수를 액션 타입, 클릭 포인트, 입력 텍스트 3가지로 분리해서 각각 검증 가능한 규칙 기반 점수 부여
GRPO(그룹 상대 정책 최적화) 알고리즘으로 모델이 여러 응답을 생성하고 스스로 좋은 응답을 골라 업데이트
이미지 해상도를 2배 높이면(1M→2M 픽셀) 수렴 속도와 성능 상한이 모두 올라감 — UI 요소가 작아서 고해상도가 중요
포맷 보상은 초반에 빠르게 수렴하므로, 후반엔 정확도 보상 비중(β)을 높이는 게 성능에 유리

Evidence

ScreenSpot 기준 3B 모델: SFT 63.55 → GUI-R1-3B 80.08 (26.3% 향상), ScreenSpot-Pro: 13.80 → 25.23 (82.8% 향상)
저수준 태스크 전체 평균 성공률: QwenVL2.5-3B 55.65 → GUI-R1-3B 80.88 (동일 3B 스케일)
고수준 태스크(GUI-Odyssey step success rate): GPT-4o 5.36 → GUI-R1-3B 41.33 (약 7.7배)
경쟁 RFT 모델 UI-R1-3B 대비 저수준 태스크에서 10점, 고수준 태스크에서 평균 3.4점 앞섬

How to Apply

GUI 자동화 에이전트를 만든다면, 대규모 SFT 데이터 수집 대신 소량 고품질 데이터(~3K)를 필터링해서 GRPO 기반 RFT로 학습하는 파이프라인을 검토해볼 것
멀티플랫폼 지원이 필요한 경우, 플랫폼별 액션을 별도 모델로 나누지 말고 click/scroll/type/enter 등 공통 액션으로 통합 스페이스를 설계하면 데이터 충돌 없이 joint training 가능
보상 함수 설계 시 format reward(α=0.2)와 accuracy reward(β=0.8)로 가중치를 비대칭하게 설정하면 성능이 더 좋음 — 포맷은 쉽게 수렴하니 정확도에 집중시킬 것

Code Example

snippet

# GUI-R1 스타일 통합 액션 스페이스 프롬프트 예시
system_prompt = """
You are a GUI Agent. Given a UI screenshot, action history, and a high-level task,
predict the next action to perform.

Output format:
<think>
[reasoning about current UI state and next step]
</think>
<answer>
[{'action': '<click|scroll|type|enter|select|press_home|press_back|complete|close>',
  'point': [x, y],
  'input_text': 'text if needed or no input text [default]'}]
</answer>
"""

# 보상 함수 예시 (Python)
def compute_reward(pred, gt, alpha=0.2, beta=0.8):
    # Format reward
    rf = 1.0 if has_valid_format(pred) else 0.0
    
    # Accuracy rewards
    r_act = 1.0 if pred['action'] == gt['action'] else 0.0
    r_point = 1.0 if point_in_bbox(pred['point'], gt['bbox']) else 0.0
    r_text = 1.0 if f1_score(pred['input_text'], gt['input_text']) > 0.5 else 0.0
    
    r_acc = r_act + r_point + r_text
    return alpha * rf + beta * r_acc

Terminology

SFT정답 예시를 보여주고 따라하게 하는 학습법. 학교에서 모범 답안 보고 베끼는 것과 비슷한데, 예시가 많을수록 좋지만 없는 상황엔 약함.

RFT강화학습 기반 파인튜닝. 모델이 여러 답을 시도해보고, 잘한 답에 보상을 줘서 스스로 개선하는 방식. 시행착오로 배우는 것.

GRPO그룹 상대 정책 최적화(Group Relative Policy Optimization). 같은 문제에 대한 여러 응답을 서로 비교해서 어떤 응답이 더 나은지 상대적으로 평가하는 강화학습 알고리즘.

LVLMLarge Vision-Language Model. 이미지와 텍스트를 동시에 이해하는 대형 모델. GPT-4V나 Qwen2.5-VL 같은 것.

GUI Grounding화면에서 특정 UI 요소(버튼, 텍스트 등)의 정확한 위치를 찾는 능력. '검색 버튼 클릭'이라고 하면 화면에서 그 버튼이 어디 있는지 좌표로 찾아내는 것.

OODOut-Of-Distribution. 학습 때 본 적 없는 새로운 상황. 예를 들어 안드로이드로만 학습했는데 처음 보는 웹사이트 UI에서도 잘 동작하면 OOD 일반화가 좋은 것.

통합 액션 스페이스여러 플랫폼(Windows, Android, Web 등)의 서로 다른 조작 방식을 click/scroll/type 같은 공통 액션 집합으로 통일한 것. USB 어댑터처럼 다양한 플러그를 하나로 맞춰주는 역할.

Related Resources

https://github.com/ritzz-ai/GUI-R1.git

Original Abstract (Expand)

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.