ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants
TL;DR Highlight
A Chinese shopping simulator where even GPT-5 only achieves 32% success rate, evaluating LLM agent shopping ability and improving performance up to 40%p with SFT+RL combination
Who Should Read
ML engineers building LLM-based shopping assistants or recommendation agents on e-commerce platforms. Also useful for developers designing agents handling personalization and multi-turn dialogue together.
Core Mechanics
- Even GPT-5 only hits 32% overall success rate, best model under 40% — current LLMs aren't yet reliable shopping assistants
- Biggest failure cause: attribute/option matching failures on details like color and size (categories and price are matched well, but fine-grained requirements fail)
- Personalization errors are bipolar: ignoring (55.8%) or over-interpreting (35.7%) — striking the right balance is the core challenge
- SFT (success trajectory imitation) + RL (GRPO reward optimization) combination outperforms standalone RL across all scenarios — SFT provides task flow priors while RL fine-tunes preferences
- Multiplicative strict reward consistently beats additive loose reward — bottleneck effect focusing on the weakest condition
- Multi-turn + personalization scenarios plateau at ~35% even after training — maintaining personalization, intent clarification, and environment actions simultaneously over time is the current LLM ceiling
Evidence
- Qwen3-8B with SFT+RL w. Rstrict: Single-Turn success 14.13% → 38.89% (+24.76%p)
- Single-Turn & Personalization: SFT+RL success 17.24% → 57.33% (+40.09%p) — largest improvement across four scenarios
- BuyNow action error analysis: 45.63% 'purchased without checking details,' 31.31% 'purchased after user rejection' — agent's rash decisions are the main failure cause
- Qwen3-8B multi-turn strict score drops 47% vs single-turn, success rate drops 50% (Qwen3-235B drops only 13%, 14%) — smaller models degrade sharply in multi-turn
How to Apply
- For shopping agent system prompt design, reference Figure 7's structure — explicitly define stage rules and response format (Thought/Action_type/Action_content) for information gathering → environment interaction → purchase confirmation stages
- When feeding personalization info to LLMs, separate 'long-term preferences (brand, style)' from 'current request (color, size)' into distinct sections — bundling them causes over-interpretation errors
- Design RL reward functions as multiplicative instead of additive — when any condition falls short, the total score drops significantly, forcing focused improvement on weak dimensions (attributes, options)
Code Example
# ShopSimulator-style agent system prompt structure (based on Figure 7)
SYSTEM_PROMPT = """
You are an intelligent shopping assistant.
## Response Format (must follow)
Thought: Reason for deciding the next action
Action_type: ask_shopper | interact_with_env
Action_content: Specific content
## Action Rules
- If interact_with_env:
- search[keyword] : Search for products
- click[value] : Click button/product
- click[buy now] : Immediate purchase
- If ask_shopper:
- Collect unconfirmed information with open-ended questions
## Decision Logic
1. Insufficient information → ask_shopper
2. Action possible → interact_with_env
3. Always get final user confirmation before purchase
4. Never execute purchase if user declines
"""
# Example of separating personalization information layer
def build_personalized_prompt(user_profile, current_request):
return f"""
{SYSTEM_PROMPT}
## User Long-term Preferences (for reference)
- Preferred brands: {user_profile['brand_preferences']}
- Preferred features: {user_profile['feature_preferences']}
- Size: {user_profile['size']}
- Price range: {user_profile['price_range']}
## Current Request (apply first)
{current_request}
Note: Use long-term preferences only as hints; if they conflict with the current request, prioritize the current request.
"""Terminology
Related Resources
Original Abstract (Expand)
Large language model (LLM)-based agents are increasingly deployed in e-commerce shopping. To perform thorough, user-tailored product searches, agents should interpret personal preferences, engage in multi-turn dialogues, and ultimately retrieve and discriminate among highly similar products. However, existing research has yet to provide a unified simulation environment that consistently captures all of these aspects, and always focuses solely on evaluation benchmarks without training support. In this paper, we introduce ShopSimulator, a large-scale and challenging Chinese shopping environment. Leveraging ShopSimulator, we evaluate LLMs across diverse scenarios, finding that even the best-performing models achieve less than 40% full-success rate. Error analysis reveals that agents struggle with deep search and product selection in long trajectories, fail to balance the use of personalization cues, and to effectively engage with users. Further training exploration provides practical guidance for overcoming these weaknesses, with the combination of supervised fine-tuning (SFT) and reinforcement learning (RL) yielding significant performance improvements. Code and data will be released at https://github.com/ShopAgent-Team/ShopSimulator.