GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Apr 14, 2025•Run Luo, Lu Wang, Wanwei He +1•View PDF

TL;DR Highlight

Training a GUI automation agent with RL using only 0.02% of the data that beats existing SOTA SFT approaches.

Who Should Read

ML engineers building or evaluating GUI automation agents that see screens and perform clicks/typing. Developers looking to improve data efficiency and reduce fine-tuning costs for LLM agents.

Core Mechanics

SFT (showing correct examples) requires tens of millions of data points, but RFT (RL-based fine-tuning) achieves better performance with just 3K samples
Actions across 5 platforms (Windows/Linux/MacOS/Android/Web) are unified into a single action space, enabling conflict-free cross-platform training
Reward function decomposes into 3 granular components: action type, click point, and input text — enabling fine-grained credit assignment

Evidence

ScreenSpot 3B model: SFT 63.55 → GUI-R1-3B 80.08 (26.3% improvement); ScreenSpot-Pro: 13.80 → 25.23 (82.8% improvement)
Low-level task overall success rate: QwenVL2.5-3B 55.65 → GUI-R1-3B 80.88 (same 3B scale)
High-level tasks (GUI-Odyssey step success rate): meaningful improvement over SFT baselines

How to Apply

If building a GUI automation agent, skip large-scale SFT data collection — instead filter ~3K high-quality samples and train with GRPO-based RFT.
For multi-platform support, don't split into per-platform models. Design a unified action space (click/scroll/type/enter) to avoid data conflicts during joint training.
Apply the 3-component reward function (action type + click point + text input) for granular RL feedback instead of binary success/fail.

Code Example

snippet

# GUI-R1 style unified action space prompt example
system_prompt = """
You are a GUI Agent. Given a UI screenshot, action history, and a high-level task,
predict the next action to perform.

Output format:
<think>
[reasoning about current UI state and next step]
</think>
<answer>
[{'action': '<click|scroll|type|enter|select|press_home|press_back|complete|close>',
  'point': [x, y],
  'input_text': 'text if needed or no input text [default]'}]
</answer>
"""

# Reward function example (Python)
def compute_reward(pred, gt, alpha=0.2, beta=0.8):
    # Format reward
    rf = 1.0 if has_valid_format(pred) else 0.0
    
    # Accuracy rewards
    r_act = 1.0 if pred['action'] == gt['action'] else 0.0
    r_point = 1.0 if point_in_bbox(pred['point'], gt['bbox']) else 0.0
    r_text = 1.0 if f1_score(pred['input_text'], gt['input_text']) > 0.5 else 0.0
    
    r_acc = r_act + r_point + r_text
    return alpha * rf + beta * r_acc

Terminology

SFTSupervised Fine-Tuning. Learning by copying correct examples, like studying model answers. Works well with lots of examples but struggles with limited data.

RFTReinforcement Fine-Tuning. The model tries multiple approaches, gets rewards for good ones, and self-improves through trial and error.

GRPOGroup Relative Policy Optimization. An RL algorithm that compares outputs within a group to determine which are better, without needing a separate reward model.

Related Resources

https://github.com/ritzz-ai/GUI-R1.git

Original Abstract (Expand)

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.