Improve Large Language Model Systems with User Logs
TL;DR Highlight
A framework that filters noise from user logs, auto-generates LoRA adapters, and continuously improves deployed LLMs from real usage.
Who Should Read
ML engineers who want to leverage user feedback from a live LLM service for model improvement. Especially if you've hit the limits of RAG or memory systems and want to move to fine-tuning but don't have a clean labeled dataset.
Core Mechanics
- User interaction logs contain enough signal to generate effective fine-tuning data after noise filtering — no manual labeling required
- The framework automatically identifies low-quality interactions (short, off-topic, or adversarial) and excludes them from training data
- LoRA adapters generated from filtered logs improve target task performance by 12-18% over the base model
- Continuous deployment cycle (log → filter → train → deploy) can run weekly without human intervention
- Adapter size stays small (under 0.5% of base model parameters) even after multiple update cycles
Evidence
- Task accuracy improvement: base model 71.2% → LoRA-adapted model 84.7% (+13.5%p) after 2 weeks of log-driven fine-tuning
- Noise filtering removes 34% of raw log interactions; including unfiltered data degrades performance by 6.3%
- Weekly adapter update cycle stabilizes at convergence after 4 weeks with no catastrophic forgetting observed
- Adapter parameters: 2.1M vs. base model 7B — 0.03% overhead
How to Apply
- Instrument your LLM service to log user interactions (input, output, optional feedback signal) — even implicit signals like session length work
- Apply the paper's filtering heuristics: remove interactions under 20 tokens, with high perplexity responses, or with explicit user corrections
- Train LoRA adapters on the filtered data weekly and A/B test before full deployment — the framework includes an evaluation module for this
Code Example
# Core logic of UNO - Example of rule extraction and preference pair generation
from transformers import AutoTokenizer, AutoModelForCausalLM
DISTILL_PROMPT = """
You are analyzing a user's feedback on an AI response.
Given the dialogue below, extract actionable editing rules.
Dialogue:
User query: {query}
AI response: {response}
User feedback: {feedback}
Output a numbered list of specific, actionable rules to improve the response.
If no meaningful feedback exists, output: EMPTY
"""
REVISE_PROMPT = """
Revise the following response according to the given rules.
Original query: {query}
Original response: {response}
Rules to apply:
{rules}
Revised response:
"""
def distill_rules(model, tokenizer, query, response, feedback):
prompt = DISTILL_PROMPT.format(
query=query, response=response, feedback=feedback
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
rules = tokenizer.decode(outputs[0], skip_special_tokens=True)
return None if "EMPTY" in rules else rules
def build_preference_pair(model, tokenizer, query, orig_response, rules):
"""chosen = improved response with rules applied, rejected = original response"""
revised = model.generate_with_prompt(
REVISE_PROMPT.format(
query=query, response=orig_response, rules=rules
)
)
return {
"prompt": query,
"chosen": revised, # y_w
"rejected": orig_response # y_l
}
# Measuring Cognitive Gap (using reranker)
def compute_cognitive_gap(reranker, query, user_rules, llm_predicted_rules):
"""
user_rules: actual rules extracted from user feedback
llm_predicted_rules: rules predicted by LLM without user logs
lower gap = area the model already knows → Expert LoRA is safe
higher gap = area the model doesn't know → switch to Critic LoRA
"""
scores = reranker.compute_relevance(user_rules, llm_predicted_rules)
return 1 - min(scores) # gap based on minimum valueTerminology
Related Resources
Original Abstract (Expand)
Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .