On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Mar 12, 2026•Deyu Zou, Yongqiang Chen, Fan Feng +4•View PDF

TL;DR Highlight

Analysis of the 'self-locking' phenomenon in RL-trained LLM agents that stop asking questions and fail to use information, with a simple directional signal injection fix that boosts performance up to 60%.

Who Should Read

Researchers training RL-based LLM agents for information-seeking or question-answering tasks, and teams debugging why their RL agents become passive or stop exploring.

Core Mechanics

Identified and named the 'self-locking' phenomenon: RL-trained agents converge to a policy of avoiding questions and failing to actively gather needed information
Root cause analysis shows self-locking emerges from reward signal imbalance — the agent learns that not asking is safer than risking a bad question
Proposed a directional signal injection method that provides a lightweight intrinsic reward for information-seeking behavior
The fix does not require retraining from scratch — can be applied as a regularization term during RL training
Self-locking is not unique to one model family; appears across multiple RL-trained agent architectures
The directional signal injection improved task success rates by up to 60% on information-seeking benchmarks

Evidence

Directional signal injection improved performance by up to 60% on information-seeking tasks
Self-locking phenomenon confirmed across multiple model families and RL training setups
Analysis of reward distributions shows the exact mechanism causing passive behavior
Ablation studies confirm that the directional signal is the key ingredient for the fix

How to Apply

If your RL-trained agent seems overly passive or stops asking clarifying questions, check for self-locking by analyzing its action distribution over training
Add the directional signal injection as an auxiliary reward term during RL training — implementation is straightforward and doesn't require architectural changes
Monitor the ratio of information-seeking actions to answer-giving actions as a diagnostic metric throughout training

Code Example

snippet

# Core idea of AREW: advantage reweighting
# Only need to modify the advantage in existing PPO

def compute_arew_advantage(advantages, as_critiques, bt_critiques, lambda_weight=0.2):
    """
    advantages: per-step advantage tensor computed with existing GAE
    as_critiques: critique at each Action Round (+1: informative, -1: uninformative, 0: abstain)
    bt_critiques: critique at each Update Round (+1: belief improved, -1: degraded, 0: abstain)
    """
    import torch
    
    # Combine AS critiques into binary labels
    critiques = []  # critiques for all turns
    for t, (as_c, bt_c) in enumerate(zip(as_critiques, bt_critiques)):
        critiques.append(as_c if t % 2 == 0 else bt_c)  # Alternate between Action/Update
    
    critiques = torch.tensor(critiques, dtype=torch.float)
    
    # Separate Positive/Negative indices
    pos_mask = (critiques == 1)
    neg_mask = (critiques == -1)
    
    n_pos = pos_mask.sum().clamp(min=1)
    n_neg = neg_mask.sum().clamp(min=1)
    
    # Compute u_t: directional weight
    u = torch.zeros_like(critiques)
    u[pos_mask] = 1.0 / n_pos
    u[neg_mask] = -1.0 / n_neg
    
    # Advantage reweighting: A_hat_t = A_t + lambda * u_t
    adjusted_advantages = advantages + lambda_weight * u
    
    return adjusted_advantages

# In the actual RL training loop:
# adj_adv = compute_arew_advantage(advantages, as_critiques, bt_critiques)
# loss = -torch.mean(log_probs * adj_adv)  # policy gradient loss

Terminology

Self-lockingThe phenomenon where RL-trained agents converge to passive policies that avoid information-seeking, causing them to get stuck without the data they need.

Directional Signal InjectionAdding a lightweight reward signal that encourages information-seeking behavior to counteract the tendency toward passive policies.

Reward Signal ImbalanceWhen the RL reward structure inadvertently discourages useful behaviors because the penalty for wrong actions outweighs the reward for exploration.

Information-seeking AgentAn agent that needs to actively ask questions or query external sources to gather information before answering.

Related Resources

Original Abstract (Expand)

Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent's belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.