ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants

Jan 26, 2026•Pei Wang, Yanan Wu, Xiaoshuai Song +13•View PDF

TL;DR Highlight

A Chinese shopping simulator where even GPT-5 only achieves 32% success rate, evaluating LLM agent shopping ability and improving performance up to 40%p with SFT+RL combination

Who Should Read

ML engineers building LLM-based shopping assistants or recommendation agents on e-commerce platforms. Also useful for developers designing agents handling personalization and multi-turn dialogue together.

Core Mechanics

Even GPT-5 only hits 32% overall success rate, best model under 40% — current LLMs aren't yet reliable shopping assistants
Biggest failure cause: attribute/option matching failures on details like color and size (categories and price are matched well, but fine-grained requirements fail)
Personalization errors are bipolar: ignoring (55.8%) or over-interpreting (35.7%) — striking the right balance is the core challenge
SFT (success trajectory imitation) + RL (GRPO reward optimization) combination outperforms standalone RL across all scenarios — SFT provides task flow priors while RL fine-tunes preferences
Multiplicative strict reward consistently beats additive loose reward — bottleneck effect focusing on the weakest condition
Multi-turn + personalization scenarios plateau at ~35% even after training — maintaining personalization, intent clarification, and environment actions simultaneously over time is the current LLM ceiling

Evidence

Qwen3-8B with SFT+RL w. Rstrict: Single-Turn success 14.13% → 38.89% (+24.76%p)
Single-Turn & Personalization: SFT+RL success 17.24% → 57.33% (+40.09%p) — largest improvement across four scenarios
BuyNow action error analysis: 45.63% 'purchased without checking details,' 31.31% 'purchased after user rejection' — agent's rash decisions are the main failure cause
Qwen3-8B multi-turn strict score drops 47% vs single-turn, success rate drops 50% (Qwen3-235B drops only 13%, 14%) — smaller models degrade sharply in multi-turn

How to Apply

For shopping agent system prompt design, reference Figure 7's structure — explicitly define stage rules and response format (Thought/Action_type/Action_content) for information gathering → environment interaction → purchase confirmation stages
When feeding personalization info to LLMs, separate 'long-term preferences (brand, style)' from 'current request (color, size)' into distinct sections — bundling them causes over-interpretation errors
Design RL reward functions as multiplicative instead of additive — when any condition falls short, the total score drops significantly, forcing focused improvement on weak dimensions (attributes, options)

Code Example

snippet

# ShopSimulator-style agent system prompt structure (based on Figure 7)

SYSTEM_PROMPT = """
You are an intelligent shopping assistant.

## Response Format (must follow)
Thought: Reason for deciding the next action
Action_type: ask_shopper | interact_with_env
Action_content: Specific content

## Action Rules
- If interact_with_env:
  - search[keyword] : Search for products
  - click[value] : Click button/product
  - click[buy now] : Immediate purchase
- If ask_shopper:
  - Collect unconfirmed information with open-ended questions

## Decision Logic
1. Insufficient information → ask_shopper
2. Action possible → interact_with_env
3. Always get final user confirmation before purchase
4. Never execute purchase if user declines
"""

# Example of separating personalization information layer
def build_personalized_prompt(user_profile, current_request):
    return f"""
{SYSTEM_PROMPT}

## User Long-term Preferences (for reference)
- Preferred brands: {user_profile['brand_preferences']}
- Preferred features: {user_profile['feature_preferences']}
- Size: {user_profile['size']}
- Price range: {user_profile['price_range']}

## Current Request (apply first)
{current_request}

Note: Use long-term preferences only as hints; if they conflict with the current request, prioritize the current request.
"""

Terminology

GRPOAn RL algorithm from DeepSeekMath. Generates multiple answers for the same problem and learns from relative rewards within the group — works without a separate critic model unlike PPO.

SFTSupervised fine-tuning showing correct examples (success trajectories) for imitation learning. Like studying from model answers in school.

RLReinforcement Learning where agents improve their strategies through environment interaction and reward signals. Training behavior through carrots and sticks.

Strict Reward (multiplicative)Multiplies category, attribute, option, and price scores together — if any is 0, the total approaches 0. Forces focused improvement on the weakest condition.

Loose Reward (additive)Averages condition satisfaction — partial satisfaction still yields some score, weakening incentive to optimize fine-grained requirements.

fine-grained discriminationAbility to precisely distinguish details like color, size, and material among similar products. Telling apart white/blue vs white/navy on the same sneaker.

multi-turn dialogueA conversation where requirements aren't fixed in one message but are progressively clarified through multiple exchanges. Like 'just sneakers' → 'badminton shoes actually' → 'white ones please.'

Related Resources

https://github.com/ShopAgent-Team/ShopSimulator

Original Abstract (Expand)

Large language model (LLM)-based agents are increasingly deployed in e-commerce shopping. To perform thorough, user-tailored product searches, agents should interpret personal preferences, engage in multi-turn dialogues, and ultimately retrieve and discriminate among highly similar products. However, existing research has yet to provide a unified simulation environment that consistently captures all of these aspects, and always focuses solely on evaluation benchmarks without training support. In this paper, we introduce ShopSimulator, a large-scale and challenging Chinese shopping environment. Leveraging ShopSimulator, we evaluate LLMs across diverse scenarios, finding that even the best-performing models achieve less than 40% full-success rate. Error analysis reveals that agents struggle with deep search and product selection in long trajectories, fail to balance the use of personalization cues, and to effectively engage with users. Further training exploration provides practical guidance for overcoming these weaknesses, with the combination of supervised fine-tuning (SFT) and reinforcement learning (RL) yielding significant performance improvements. Code and data will be released at https://github.com/ShopAgent-Team/ShopSimulator.