LLM Agent의 효율적 탐색(Exploration)을 위한 PSRL 구현

Toward Efficient Exploration by Large Language Model Agents

Apr 29, 2025•Dilip Arumugam, Thomas L. Griffiths•View PDF

TL;DR Highlight

LLM을 새 알고리즘 발명에 쓰지 말고, 수십 년 된 RL 알고리즘(PSRL)을 LLM으로 직접 구현하면 탐색 효율이 극적으로 올라간다.

Who Should Read

LLM 기반 에이전트를 설계하면서 탐색-활용 트레이드오프(exploration-exploitation)를 고민하는 AI 엔지니어. 추천 시스템, 고객 서비스 자동화, 게임 AI 등 순차적 의사결정 시스템을 만드는 개발자.

Core Mechanics

기존 LLM 에이전트(Reflexion, ICRL 등)는 '어떻게 탐색할지'를 LLM 자체에 맡기는데, 이 방식은 간단한 환경에서도 탐색을 제대로 못 함
대신 수십 년 전에 증명된 PSRL(Posterior Sampling for RL) 알고리즘의 각 단계를 별도 LLM이 담당하게 하면 탐색 효율이 크게 올라감
3개의 LLM이 역할 분담: (1) 사후 분포(posterior) 업데이터, (2) 사후 샘플 생성기, (3) 샘플 기반 최적 정책 실행기
사용하는 LLM 모델 품질이 결정적 - GPT-4o는 RiverSwim 환경에서 선형 후회(linear regret)를 내지만 o1-mini로 교체하면 classic PSRL 수준으로 올라감
Wordle 게임에서 LLM-based PSRL(GPT-4o)이 DeepSeek-R1을 쓴 최강 베이스라인보다 통계적으로 유의미하게 우수
Thompson Sampling 한계를 넘기 위해 Information-Directed Sampling(IDS)을 LLM으로 구현하는 초기 결과도 제시

Evidence

5-armed Bernoulli Bandit에서 κsampling=1.2 설정의 LLM-PSRL이 classic Thompson Sampling보다 낮은 누적 후회(cumulative regret) 달성 (T=100 기준)
RiverSwim(3-state)에서 GPT-4o 기반 PSRL은 선형 후회, o1-mini 교체 시 classic PSRL과 동등한 준선형 후회 달성
Combination lock(720가지 코드)에서 무작위 탐색 성공률 0.14% 미만인 환경에서 LLM-PSRL이 모든 베이스라인 대비 가장 낮은 누적 후회
Wordle 40회 시험에서 DeepSeek-R1을 쓴 최강 베이스라인(ICRL)이 GPT-4o 기반 LLM-PSRL을 통계적으로 유의미하게 앞서지 못함

How to Apply

LLM 에이전트에 '탐색하라'고 프롬프트로 지시하는 대신, posterior 업데이트/샘플링/최적 행동 선택을 각각 별도 LLM 호출로 분리하고 PSRL 알고리즘 흐름대로 오케스트레이션하면 된다
Prior를 자연어로 지정할 때 'Beta(1,1) 분포' 같은 통계 분포명을 언급하면 posterior 업데이트 LLM이 자동으로 방문 횟수를 추적하는 방향으로 동작하므로, 도메인 지식을 prior 문자열에 명시적으로 녹여라
복잡한 stochastic 환경(상태-행동 공간이 클수록)에서는 GPT-4o 대신 o1-mini나 DeepSeek-R1 같은 추론 특화 모델을 posterior 업데이터 및 샘플러로 사용하고, 환경이 단순하면 GPT-4o로도 충분하다

Code Example

snippet

# LLM-based PSRL 핵심 루프 (의사코드)

SYSTEM_POSTERIOR_UPDATER = """
You are a Bayesian posterior distribution for a sequential decision-making problem.
Given a current prior belief and trajectory observation, produce the updated posterior
that accurately reflects knowledge about environment transitions and rewards.
Never discard prior knowledge — only update it.
Environment: {env_description}
"""

SYSTEM_POSTERIOR_SAMPLER = """
Given the current posterior belief, generate ONE plausible hypothesis
(a concrete sample) for how transitions and rewards work in this environment.
Your sample must be consistent with the posterior constraints.
Start with 'You think' and provide only the hypothesis.
"""

SYSTEM_OPTIMAL_POLICY = """
Environment: {env_description}
Always select optimal actions that maximize value according to this hypothesis:
{posterior_sample}
Just say the action after 'Action:' and nothing else.
"""

def llm_psrl_episode(prior, env, llm_updater, llm_sampler, llm_policy):
    # Step 1: Sample one hypothesis from current posterior
    posterior_sample = llm_sampler.call(
        system=SYSTEM_POSTERIOR_SAMPLER,
        user=f"Current posterior: {prior}"
    )
    
    # Step 2: Act optimally w.r.t. sampled hypothesis for entire episode
    trajectory = []
    state = env.reset()
    for step in range(env.horizon):
        action = llm_policy.call(
            system=SYSTEM_OPTIMAL_POLICY.format(
                env_description=env.description,
                posterior_sample=posterior_sample
            ),
            user=f"Current state: {state}"
        )
        next_state, reward = env.step(action)
        trajectory.append((state, action, reward, next_state))
        state = next_state
    
    # Step 3: Update posterior with full trajectory
    new_posterior = llm_updater.call(
        system=SYSTEM_POSTERIOR_UPDATER.format(
            env_description=env.description
        ),
        user=f"Prior: {prior}\nTrajectory: {format_trajectory(trajectory)}"
    )
    
    return new_posterior, trajectory

# 메인 루프
prior = "Beta(1,1) prior for each arm — all actions equally likely to be optimal"
for episode in range(K):
    prior, traj = llm_psrl_episode(prior, env, updater_llm, sampler_llm, policy_llm)

Terminology

PSRLPosterior Sampling for Reinforcement Learning의 약자. 에이전트가 세상에 대한 '그럴듯한 가설 하나'를 뽑아서 그 가설이 맞다고 가정하고 행동하는 탐색 전략. 마치 가능한 여러 지도 중 하나를 골라 그 지도를 믿고 탐험하는 것과 같음.

Thompson Sampling불확실한 옵션들 중 하나를 고를 때, 각 옵션이 최선일 확률에 비례해서 랜덤하게 선택하는 방법. '확신이 없을수록 더 다양하게 시도한다'는 원리.

Posterior새로운 데이터를 관찰한 후 업데이트된 믿음(확률 분포). 처음 가진 믿음(prior)에 실제 경험을 반영해서 수정한 버전.

Cumulative Regret에이전트가 최적 행동 대신 다른 행동을 해서 누적으로 놓친 보상의 합. 낮을수록 학습이 효율적이라는 뜻.

MDPMarkov Decision Process의 약자. 상태→행동→보상→다음상태로 이어지는 순차 의사결정 문제의 수학적 틀. 게임이나 로봇 제어 등을 모델링할 때 사용.

ICLIn-Context Learning의 약자. 모델 가중치를 바꾸지 않고 프롬프트에 예시를 넣어 LLM이 그 패턴을 따르게 하는 방법. 파인튜닝 없이 few-shot 예시만으로 학습시키는 것.

Exploration vs Exploitation탐색(새로운 것 시도)과 활용(현재 아는 최선 선택) 사이의 트레이드오프. 식당 추천에서 '새 식당 도전 vs 단골 식당 방문'과 같은 딜레마.

Linear Regret에피소드가 늘어날수록 후회(regret)도 비례해서 늘어나는 것. 학습이 전혀 안 되고 있다는 신호.

Related Resources

Customer Service 데이터셋 (Paprika)

Original Abstract (Expand)

A burgeoning area within reinforcement learning (RL) is the design of sequential decision-making agents centered around large language models (LLMs). While autonomous decision-making agents powered by modern LLMs could facilitate numerous real-world applications, such successes demand agents that are capable of data-efficient RL. One key obstacle to achieving data efficiency in RL is exploration, a challenge that we demonstrate many recent proposals for LLM agent designs struggle to contend with. Meanwhile, classic algorithms from the RL literature known to gracefully address exploration require technical machinery that can be challenging to operationalize in purely natural language settings. In this work, rather than relying on finetuning or in-context learning to coax LLMs into implicitly imitating a RL algorithm, we illustrate how LLMs can be used to explicitly implement an existing RL algorithm (Posterior Sampling for Reinforcement Learning) whose capacity for statistically-efficient exploration is already well-studied. We offer empirical results demonstrating how our LLM-based implementation of a known, data-efficient RL algorithm can be considerably more effective in natural language tasks that demand prudent exploration.