Online Experiential Learning for Language Models
TL;DR Highlight
An LLM framework that keeps learning from real-world usage after deployment — no reward functions, no human labeling needed.
Who Should Read
ML engineers deploying LLM agents to production and wanting continuous performance improvement. Especially developers designing post-deployment model update pipelines.
Core Mechanics
- A 2-stage loop that auto-extracts 'experiential knowledge' from text trajectories of deployed model interactions with real environments, then internalizes it into model parameters
- Fully reward-free online learning using only text feedback — no reward functions, reward models, or human annotations needed
- On-Policy Context Distillation (using the model's own generated responses as training data) prevents OOD performance degradation (catastrophic forgetting)
- Extracted experiential knowledge is far more effective than raw trajectories — 7.8% vs 21.4% pass rate on Sokoban
- The model's own experiential knowledge (Qwen3-1.7B) is more effective than a larger model's (Qwen3-4B) — on-policy consistency is what matters
- Response length decreases with iterations, improving token efficiency — ~70% of original response length after 3 rounds on Frozen Lake
Evidence
- Sokoban: experiential knowledge in-context 18.2%, after consolidation 21.4% vs raw trajectory 10.9% / 7.8%
- Frozen Lake with Qwen3-1.7B: own experiential knowledge 23.8% (in-context), 31.1% (consolidation) vs larger model (Qwen3-4B) knowledge 18.0% / 22.7%
- On-policy context distillation maintains IF-Eval OOD accuracy at initial model level (~66-67%), while off-policy shows clear degradation over training
- Consistent pass rate improvement per OEL round across Qwen3-1.7B, 4B, and 8B — effective regardless of model size
How to Apply
- After deploying an agent service, collect multi-turn conversation logs with users and periodically run prompts with the same model to extract experiential knowledge in '– EXPERIENCE ITEM:' format.
- Use extracted experiential knowledge as context for a teacher, then fine-tune using reverse KL divergence against the model's own generated responses — trainable server-side without environment access.
- Directly applicable to environments with text feedback like RAG or game agents — especially useful for open-domain agents where reward design is difficult.
Code Example
# Experience knowledge extraction prompt example (structured format)
prompt_template = """
You are an AI language model that continuously refines its internal experience.
Here is the interaction history (the environment (input) and your response and action (output)):
{latest_experience}
Here is the previous experience:
# Experience
{previous_experience}
Your task:
Based on the multi-round interaction history, generate experience for future learning.
Conduct a deep, comparative analysis to infer the rules and the fundamental principles behind success and failure.
Organize insights into 1-2 concise, high-level, widely applicable experience items.
Rules:
- Format MUST be:
- EXPERIENCE ITEM: ...
- Do NOT repeat previous experience.
- Make experience general, not specific to the current case.
Additional Experience:
# Experience
- EXPERIENCE ITEM:
"""
# Solve new problems by attaching experience knowledge to context
solving_prompt = """
You are an agent acting as a reasoning engine.
Your decisions are based on the experience you have learned.
This experience may be incomplete or incorrect.
Given experience:
{experience}
Current situation:
{prompt}
What action do you take?
"""Terminology
Related Resources
Original Abstract (Expand)
The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.