Online Experiential Learning for Language Models

Mar 17, 2026•Tianzhu Ye, Li Dong, Qingxiu Dong +3•View PDF

TL;DR Highlight

An LLM framework that keeps learning from real-world usage after deployment — no reward functions, no human labeling needed.

Who Should Read

ML engineers deploying LLM agents to production and wanting continuous performance improvement. Especially developers designing post-deployment model update pipelines.

Core Mechanics

A 2-stage loop that auto-extracts 'experiential knowledge' from text trajectories of deployed model interactions with real environments, then internalizes it into model parameters
Fully reward-free online learning using only text feedback — no reward functions, reward models, or human annotations needed
On-Policy Context Distillation (using the model's own generated responses as training data) prevents OOD performance degradation (catastrophic forgetting)
Extracted experiential knowledge is far more effective than raw trajectories — 7.8% vs 21.4% pass rate on Sokoban
The model's own experiential knowledge (Qwen3-1.7B) is more effective than a larger model's (Qwen3-4B) — on-policy consistency is what matters
Response length decreases with iterations, improving token efficiency — ~70% of original response length after 3 rounds on Frozen Lake

Evidence

Sokoban: experiential knowledge in-context 18.2%, after consolidation 21.4% vs raw trajectory 10.9% / 7.8%
Frozen Lake with Qwen3-1.7B: own experiential knowledge 23.8% (in-context), 31.1% (consolidation) vs larger model (Qwen3-4B) knowledge 18.0% / 22.7%
On-policy context distillation maintains IF-Eval OOD accuracy at initial model level (~66-67%), while off-policy shows clear degradation over training
Consistent pass rate improvement per OEL round across Qwen3-1.7B, 4B, and 8B — effective regardless of model size

How to Apply

After deploying an agent service, collect multi-turn conversation logs with users and periodically run prompts with the same model to extract experiential knowledge in '– EXPERIENCE ITEM:' format.
Use extracted experiential knowledge as context for a teacher, then fine-tune using reverse KL divergence against the model's own generated responses — trainable server-side without environment access.
Directly applicable to environments with text feedback like RAG or game agents — especially useful for open-domain agents where reward design is difficult.

Code Example

snippet

# Experience knowledge extraction prompt example (structured format)
prompt_template = """
You are an AI language model that continuously refines its internal experience.

Here is the interaction history (the environment (input) and your response and action (output)):
{latest_experience}

Here is the previous experience:
# Experience
{previous_experience}

Your task:
Based on the multi-round interaction history, generate experience for future learning.
Conduct a deep, comparative analysis to infer the rules and the fundamental principles behind success and failure.
Organize insights into 1-2 concise, high-level, widely applicable experience items.

Rules:
- Format MUST be:
  - EXPERIENCE ITEM: ...
- Do NOT repeat previous experience.
- Make experience general, not specific to the current case.

Additional Experience:
# Experience
- EXPERIENCE ITEM:
"""

# Solve new problems by attaching experience knowledge to context
solving_prompt = """
You are an agent acting as a reasoning engine.
Your decisions are based on the experience you have learned.
This experience may be incomplete or incorrect.

Given experience:
{experience}

Current situation:
{prompt}

What action do you take?
"""

Terminology

On-PolicyTraining a model on data it generated itself. Like proofreading and correcting your own writing.

Context DistillationCompressing a teacher model's ability to answer with long context (e.g., experiential knowledge) into student model parameters so it works without the context.

Reverse KL DivergenceOne way to measure the difference between two probability distributions. Guides the student model to intensively follow the teacher model's key patterns.

Catastrophic ForgettingThe phenomenon of forgetting previously learned knowledge while learning new things. The problem of losing general capabilities after fine-tuning.

TrajectoryThe complete sequence of action-feedback exchanges between an agent and its environment. Like the full record of a single game play.

OOD (Out-of-Distribution)Input from a different distribution than training data. A metric for whether the model still answers well on different types of questions after training on specific games.

Reward-FreeNo numerical scores (rewards) needed for training. Creates learning signals from text feedback alone.

Related Resources

Original Abstract (Expand)

The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.