Toward Efficient Exploration by Large Language Model Agents

Apr 29, 2025•Dilip Arumugam, Thomas L. Griffiths•View PDF

TL;DR Highlight

Don't use LLMs to invent new algorithms — instead, implement a decades-old RL algorithm (PSRL) with LLMs as components, and exploration efficiency improves dramatically.

Who Should Read

AI engineers dealing with exploration-exploitation tradeoffs in LLM agent design. Developers building sequential decision-making systems for recommendations, customer service automation, or game AI.

Core Mechanics

Existing LLM agents (Reflexion, ICRL, etc.) delegate 'how to explore' to the LLM itself, but this approach fails to explore properly even in simple environments
Instead, having separate LLMs handle each step of the proven PSRL algorithm dramatically improves exploration efficiency
3 LLMs with role division: (1) posterior update, (2) model sampling, (3) optimal action selection — orchestrated following the PSRL flow
Priors can be specified in natural language, letting the LLM automatically track visit counts and update distributions

Evidence

5-armed Bernoulli Bandit: LLM-PSRL with κsampling=1.2 achieved lower cumulative regret than classic Thompson Sampling (at T=100)
RiverSwim (3-state): GPT-4o-based PSRL showed linear regret, but switching to o1-mini matched classic PSRL performance
Consistent improvements over LLM-native exploration strategies across multiple environments

How to Apply

Instead of prompting your LLM agent to 'explore', split posterior update/sampling/optimal action selection into separate LLM calls and orchestrate them following the PSRL algorithm flow.
When specifying priors in natural language, mentioning statistical distribution names like 'Beta(1,1) distribution' helps the posterior update LLM automatically track visit counts.
Use stronger reasoning models (o1-mini level) for the posterior update step — weaker models may not handle distribution tracking reliably.

Code Example

snippet

# LLM-based PSRL core loop (pseudocode)

SYSTEM_POSTERIOR_UPDATER = """
You are a Bayesian posterior distribution for a sequential decision-making problem.
Given a current prior belief and trajectory observation, produce the updated posterior
that accurately reflects knowledge about environment transitions and rewards.
Never discard prior knowledge — only update it.
Environment: {env_description}
"""

SYSTEM_POSTERIOR_SAMPLER = """
Given the current posterior belief, generate ONE plausible hypothesis
(a concrete sample) for how transitions and rewards work in this environment.
Your sample must be consistent with the posterior constraints.
Start with 'You think' and provide only the hypothesis.
"""

SYSTEM_OPTIMAL_POLICY = """
Environment: {env_description}
Always select optimal actions that maximize value according to this hypothesis:
{posterior_sample}
Just say the action after 'Action:' and nothing else.
"""

def llm_psrl_episode(prior, env, llm_updater, llm_sampler, llm_policy):
    # Step 1: Sample one hypothesis from current posterior
    posterior_sample = llm_sampler.call(
        system=SYSTEM_POSTERIOR_SAMPLER,
        user=f"Current posterior: {prior}"
    )
    
    # Step 2: Act optimally w.r.t. sampled hypothesis for entire episode
    trajectory = []
    state = env.reset()
    for step in range(env.horizon):
        action = llm_policy.call(
            system=SYSTEM_OPTIMAL_POLICY.format(
                env_description=env.description,
                posterior_sample=posterior_sample
            ),
            user=f"Current state: {state}"
        )
        next_state, reward = env.step(action)
        trajectory.append((state, action, reward, next_state))
        state = next_state
    
    # Step 3: Update posterior with full trajectory
    new_posterior = llm_updater.call(
        system=SYSTEM_POSTERIOR_UPDATER.format(
            env_description=env.description
        ),
        user=f"Prior: {prior}\nTrajectory: {format_trajectory(trajectory)}"
    )
    
    return new_posterior, trajectory

# Main loop
prior = "Beta(1,1) prior for each arm — all actions equally likely to be optimal"
for episode in range(K):
    prior, traj = llm_psrl_episode(prior, env, updater_llm, sampler_llm, policy_llm)

Terminology

PSRLPosterior Sampling for Reinforcement Learning. The agent draws one 'plausible hypothesis' about the world and acts as if it's true. Like picking one possible map from many and exploring based on that map.

Thompson SamplingWhen choosing among uncertain options, sample from each option's probability distribution and pick the one that looks best this round. Naturally balances trying new things vs. sticking with what works.

Related Resources

Customer Service Dataset (Paprika)

Original Abstract (Expand)

A burgeoning area within reinforcement learning (RL) is the design of sequential decision-making agents centered around large language models (LLMs). While autonomous decision-making agents powered by modern LLMs could facilitate numerous real-world applications, such successes demand agents that are capable of data-efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in-context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically-efficient exploration is already well-studied. We offer empirical results demonstrating how our LLM-based implementation of a known, data-efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.