On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
TL;DR Highlight
Analysis of the 'self-locking' phenomenon in RL-trained LLM agents that stop asking questions and fail to use information, with a simple directional signal injection fix that boosts performance up to 60%.
Who Should Read
Researchers training RL-based LLM agents for information-seeking or question-answering tasks, and teams debugging why their RL agents become passive or stop exploring.
Core Mechanics
- Identified and named the 'self-locking' phenomenon: RL-trained agents converge to a policy of avoiding questions and failing to actively gather needed information
- Root cause analysis shows self-locking emerges from reward signal imbalance — the agent learns that not asking is safer than risking a bad question
- Proposed a directional signal injection method that provides a lightweight intrinsic reward for information-seeking behavior
- The fix does not require retraining from scratch — can be applied as a regularization term during RL training
- Self-locking is not unique to one model family; appears across multiple RL-trained agent architectures
- The directional signal injection improved task success rates by up to 60% on information-seeking benchmarks
Evidence
- Directional signal injection improved performance by up to 60% on information-seeking tasks
- Self-locking phenomenon confirmed across multiple model families and RL training setups
- Analysis of reward distributions shows the exact mechanism causing passive behavior
- Ablation studies confirm that the directional signal is the key ingredient for the fix
How to Apply
- If your RL-trained agent seems overly passive or stops asking clarifying questions, check for self-locking by analyzing its action distribution over training
- Add the directional signal injection as an auxiliary reward term during RL training — implementation is straightforward and doesn't require architectural changes
- Monitor the ratio of information-seeking actions to answer-giving actions as a diagnostic metric throughout training
Code Example
# Core idea of AREW: advantage reweighting
# Only need to modify the advantage in existing PPO
def compute_arew_advantage(advantages, as_critiques, bt_critiques, lambda_weight=0.2):
"""
advantages: per-step advantage tensor computed with existing GAE
as_critiques: critique at each Action Round (+1: informative, -1: uninformative, 0: abstain)
bt_critiques: critique at each Update Round (+1: belief improved, -1: degraded, 0: abstain)
"""
import torch
# Combine AS critiques into binary labels
critiques = [] # critiques for all turns
for t, (as_c, bt_c) in enumerate(zip(as_critiques, bt_critiques)):
critiques.append(as_c if t % 2 == 0 else bt_c) # Alternate between Action/Update
critiques = torch.tensor(critiques, dtype=torch.float)
# Separate Positive/Negative indices
pos_mask = (critiques == 1)
neg_mask = (critiques == -1)
n_pos = pos_mask.sum().clamp(min=1)
n_neg = neg_mask.sum().clamp(min=1)
# Compute u_t: directional weight
u = torch.zeros_like(critiques)
u[pos_mask] = 1.0 / n_pos
u[neg_mask] = -1.0 / n_neg
# Advantage reweighting: A_hat_t = A_t + lambda * u_t
adjusted_advantages = advantages + lambda_weight * u
return adjusted_advantages
# In the actual RL training loop:
# adj_adv = compute_arew_advantage(advantages, as_critiques, bt_critiques)
# loss = -torch.mean(log_probs * adj_adv) # policy gradient lossTerminology
Related Resources
Original Abstract (Expand)
Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent's belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.