GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
TL;DR Highlight
Training a GUI automation agent with RL using only 0.02% of the data that beats existing SOTA SFT approaches.
Who Should Read
ML engineers building or evaluating GUI automation agents that see screens and perform clicks/typing. Developers looking to improve data efficiency and reduce fine-tuning costs for LLM agents.
Core Mechanics
- SFT (showing correct examples) requires tens of millions of data points, but RFT (RL-based fine-tuning) achieves better performance with just 3K samples
- Actions across 5 platforms (Windows/Linux/MacOS/Android/Web) are unified into a single action space, enabling conflict-free cross-platform training
- Reward function decomposes into 3 granular components: action type, click point, and input text — enabling fine-grained credit assignment
Evidence
- ScreenSpot 3B model: SFT 63.55 → GUI-R1-3B 80.08 (26.3% improvement); ScreenSpot-Pro: 13.80 → 25.23 (82.8% improvement)
- Low-level task overall success rate: QwenVL2.5-3B 55.65 → GUI-R1-3B 80.88 (same 3B scale)
- High-level tasks (GUI-Odyssey step success rate): meaningful improvement over SFT baselines
How to Apply
- If building a GUI automation agent, skip large-scale SFT data collection — instead filter ~3K high-quality samples and train with GRPO-based RFT.
- For multi-platform support, don't split into per-platform models. Design a unified action space (click/scroll/type/enter) to avoid data conflicts during joint training.
- Apply the 3-component reward function (action type + click point + text input) for granular RL feedback instead of binary success/fail.
Code Example
# GUI-R1 style unified action space prompt example
system_prompt = """
You are a GUI Agent. Given a UI screenshot, action history, and a high-level task,
predict the next action to perform.
Output format:
<think>
[reasoning about current UI state and next step]
</think>
<answer>
[{'action': '<click|scroll|type|enter|select|press_home|press_back|complete|close>',
'point': [x, y],
'input_text': 'text if needed or no input text [default]'}]
</answer>
"""
# Reward function example (Python)
def compute_reward(pred, gt, alpha=0.2, beta=0.8):
# Format reward
rf = 1.0 if has_valid_format(pred) else 0.0
# Accuracy rewards
r_act = 1.0 if pred['action'] == gt['action'] else 0.0
r_point = 1.0 if point_in_bbox(pred['point'], gt['bbox']) else 0.0
r_text = 1.0 if f1_score(pred['input_text'], gt['input_text']) > 0.5 else 0.0
r_acc = r_act + r_point + r_text
return alpha * rf + beta * r_accTerminology
Related Resources
Original Abstract (Expand)
Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.