Can RL Improve Generalization of LLM Agents? An Empirical Study

Mar 12, 2026•Zhiheng Xi, Xin Guo, Jiaqi Liu +11•View PDF

TL;DR Highlight

RFT-trained LLM agents generalize well within the same environment but transfer to new environments is limited — sequential multi-environment training may be the solution.

Who Should Read

ML engineers and researchers looking to deploy LLM-based agents in real-world services. Especially devs who want agents to work well on new tasks beyond their training environment.

Core Mechanics

Within the same environment, RFT shows strong generalization — even training on easy tasks leads to significant performance gains on hard tasks (+60.1 points on WebShop for 7B model)
Easy→Hard curriculum learning is more effective than single-difficulty training — Ueasy+Uhard combination in BabyAI gives up to 3.3 additional points vs training on either alone
Cross-environment transfer averages only +3-4 points — very large gap between held-in and held-out performance
Training in environments that provide valid action lists at every step (like BabyAI) actually hurts — 7B model crashes from 28.59→10.25 on WebShop
Sequential multi-environment training (Sequential RFT) improves new environment performance without catastrophic forgetting — WebShop→TextCraft sequential training: TextCraft 80.88→82.50, WebShop maintained at 86.5→86.32
Confirmation Bias (overconfidence error) confirmed as the most common failure pattern at 10%+ across all environments

Evidence

Held-in environment: AlfWorld 3B/7B models +78.62/+65.44 points improvement; held-out environment average +3.32/+3.44 points — stark contrast
Qwen2.5-7B-Instruct: after training on BabyAI, held-out average -3.23 points, WebShop crashes 28.59→10.25
After RFT training in BabyAI, average interaction turns drop 10.76→4.19, average tokens 624.58→160.60 (about 74% reduction)
5-environment sequential training result approaches joint training (all data mixed) level — relatively insensitive to training order

How to Apply

Before deploying agents to a new environment, applying Easy→Hard curriculum learning in the existing environment delivers higher performance with the same data
When building general-purpose agents for multiple environments, instead of mixing all environment data at once, sequentially run RFT starting from the most similar environment to progressively expand capability without forgetting
Training exclusively in environments with valid action lists at every step (BabyAI-style) can hurt performance in other environments — when training general-purpose agents, place such environments last or reduce their weight

Code Example

snippet

# Sequential RFT training example with AgentGym-RL framework
# https://github.com/woooodyy/AgentGym-RL

# Step 1: Train on easy tasks first (WebShop easy)
python train.py \
  --model Qwen2.5-7B-Instruct \
  --env webshop \
  --difficulty easy \
  --n_samples 8 \
  --max_response_length 8192 \
  --max_turns 10 \
  --algorithm GRPO

# Step 2: Curriculum learning on hard tasks (WebShop hard)
python train.py \
  --model outputs/webshop_easy_checkpoint \
  --env webshop \
  --difficulty hard \
  --n_samples 8

# Step 3: Sequential transfer learning to a new environment (TextCraft)
python train.py \
  --model outputs/webshop_all_checkpoint \
  --env textcraft \
  --n_samples 8 \
  --max_turns 15

# Evaluation: Check both held-in / held-out performance
python evaluate.py \
  --model outputs/sequential_checkpoint \
  --envs webshop searchqa textcraft alfworld babyai \
  --metric avg@8 \
  --max_turns 20

Terminology

RFTReinforcement Fine-Tuning. A method that fine-tunes LLMs to maximize rewards through direct interaction with environments. Like a game character building skills through trial and error.

GRPOGroup Relative Policy Optimization. A reinforcement learning algorithm that creates training signals from relative scores within a group of sampled responses. More memory-efficient than PPO since it doesn't need a separate critic network.

held-in / held-outEvaluation that distinguishes environments used in training (held-in) from new environments not used (held-out). Like the difference between practiced problems (held-in) and unseen problems (held-out) on a test.

catastrophic forgettingThe phenomenon of suddenly forgetting previously learned things when learning new tasks. Similar to forgetting your native language while learning a new one — one of the biggest challenges in continual learning.

curriculum learningA strategy of training from easy to hard in order. Same principle as education starting from elementary math and progressing to advanced.

ReActA combination of Reasoning + Acting. An interaction paradigm where the LLM writes its thought (Thought) first, then takes Action. Leaves reasoning traces as text explaining why each action was taken.

avg@8A metric measuring average performance across 8 samplings. A more stable way to measure performance of stochastic models by averaging multiple attempts.

Related Resources

Original Abstract (Expand)

Reinforcement fine-tuning (RFT) has shown promise for training LLM agents to perform multi-turn decision-making based on environment feedback. However, most existing evaluations remain largely in-domain: training and testing are conducted in the same environment or even on the same tasks. In real-world deployment, agents may operate in unseen environments with different background knowledge, observation spaces, and action interfaces. To characterize the generalization profile of RFT under such shifts, we conduct a systematic study along three axes: (1) within-environment generalization across task difficulty, (2) cross-environment transfer to unseen environments, and (3) sequential multi-environment training to quantify transfer and forgetting. Our results show that RFT generalizes well across task difficulty within an environment, but exhibits weaker transfer to unseen environments, which correlates with shifts in both semantic priors and observation/action interfaces. In contrast, sequential training yields promising downstream gains with minimal upstream forgetting, and mixture training across environments improves the overall balance. We further provide detailed analyses and deeper insights, and hope our work helps the community develop and deploy generalizable LLM agents.