Can RL Improve Generalization of LLM Agents? An Empirical Study
TL;DR Highlight
RFT-trained LLM agents generalize well within the same environment but transfer to new environments is limited — sequential multi-environment training may be the solution.
Who Should Read
ML engineers and researchers looking to deploy LLM-based agents in real-world services. Especially devs who want agents to work well on new tasks beyond their training environment.
Core Mechanics
- Within the same environment, RFT shows strong generalization — even training on easy tasks leads to significant performance gains on hard tasks (+60.1 points on WebShop for 7B model)
- Easy→Hard curriculum learning is more effective than single-difficulty training — Ueasy+Uhard combination in BabyAI gives up to 3.3 additional points vs training on either alone
- Cross-environment transfer averages only +3-4 points — very large gap between held-in and held-out performance
- Training in environments that provide valid action lists at every step (like BabyAI) actually hurts — 7B model crashes from 28.59→10.25 on WebShop
- Sequential multi-environment training (Sequential RFT) improves new environment performance without catastrophic forgetting — WebShop→TextCraft sequential training: TextCraft 80.88→82.50, WebShop maintained at 86.5→86.32
- Confirmation Bias (overconfidence error) confirmed as the most common failure pattern at 10%+ across all environments
Evidence
- Held-in environment: AlfWorld 3B/7B models +78.62/+65.44 points improvement; held-out environment average +3.32/+3.44 points — stark contrast
- Qwen2.5-7B-Instruct: after training on BabyAI, held-out average -3.23 points, WebShop crashes 28.59→10.25
- After RFT training in BabyAI, average interaction turns drop 10.76→4.19, average tokens 624.58→160.60 (about 74% reduction)
- 5-environment sequential training result approaches joint training (all data mixed) level — relatively insensitive to training order
How to Apply
- Before deploying agents to a new environment, applying Easy→Hard curriculum learning in the existing environment delivers higher performance with the same data
- When building general-purpose agents for multiple environments, instead of mixing all environment data at once, sequentially run RFT starting from the most similar environment to progressively expand capability without forgetting
- Training exclusively in environments with valid action lists at every step (BabyAI-style) can hurt performance in other environments — when training general-purpose agents, place such environments last or reduce their weight
Code Example
# Sequential RFT training example with AgentGym-RL framework
# https://github.com/woooodyy/AgentGym-RL
# Step 1: Train on easy tasks first (WebShop easy)
python train.py \
--model Qwen2.5-7B-Instruct \
--env webshop \
--difficulty easy \
--n_samples 8 \
--max_response_length 8192 \
--max_turns 10 \
--algorithm GRPO
# Step 2: Curriculum learning on hard tasks (WebShop hard)
python train.py \
--model outputs/webshop_easy_checkpoint \
--env webshop \
--difficulty hard \
--n_samples 8
# Step 3: Sequential transfer learning to a new environment (TextCraft)
python train.py \
--model outputs/webshop_all_checkpoint \
--env textcraft \
--n_samples 8 \
--max_turns 15
# Evaluation: Check both held-in / held-out performance
python evaluate.py \
--model outputs/sequential_checkpoint \
--envs webshop searchqa textcraft alfworld babyai \
--metric avg@8 \
--max_turns 20Terminology
Related Resources
Original Abstract (Expand)
Reinforcement fine-tuning (RFT) has shown promise for training LLM agents to perform multi-turn decision-making based on environment feedback. However, most existing evaluations remain largely in-domain: training and testing are conducted in the same environment or even on the same tasks. In real-world deployment, agents may operate in unseen environments with different background knowledge, observation spaces, and action interfaces. To characterize the generalization profile of RFT under such shifts, we conduct a systematic study along three axes: (1) within-environment generalization across task difficulty, (2) cross-environment transfer to unseen environments, and (3) sequential multi-environment training to quantify transfer and forgetting. Our results show that RFT generalizes well across task difficulty within an environment, but exhibits weaker transfer to unseen environments, which correlates with shifts in both semantic priors and observation/action interfaces. In contrast, sequential training yields promising downstream gains with minimal upstream forgetting, and mixture training across environments improves the overall balance. We further provide detailed analyses and deeper insights, and hope our work helps the community develop and deploy generalizable LLM agents.