Improve Large Language Model Systems with User Logs

Feb 6, 2026•Changyue Wang, Weihang Su, Qingyao Ai +1•View PDF

TL;DR Highlight

A framework that filters noise from user logs, auto-generates LoRA adapters, and continuously improves deployed LLMs from real usage.

Who Should Read

ML engineers who want to leverage user feedback from a live LLM service for model improvement. Especially if you've hit the limits of RAG or memory systems and want to move to fine-tuning but don't have a clean labeled dataset.

Core Mechanics

User interaction logs contain enough signal to generate effective fine-tuning data after noise filtering — no manual labeling required
The framework automatically identifies low-quality interactions (short, off-topic, or adversarial) and excludes them from training data
LoRA adapters generated from filtered logs improve target task performance by 12-18% over the base model
Continuous deployment cycle (log → filter → train → deploy) can run weekly without human intervention
Adapter size stays small (under 0.5% of base model parameters) even after multiple update cycles

Evidence

Task accuracy improvement: base model 71.2% → LoRA-adapted model 84.7% (+13.5%p) after 2 weeks of log-driven fine-tuning
Noise filtering removes 34% of raw log interactions; including unfiltered data degrades performance by 6.3%
Weekly adapter update cycle stabilizes at convergence after 4 weeks with no catastrophic forgetting observed
Adapter parameters: 2.1M vs. base model 7B — 0.03% overhead

How to Apply

Instrument your LLM service to log user interactions (input, output, optional feedback signal) — even implicit signals like session length work
Apply the paper's filtering heuristics: remove interactions under 20 tokens, with high perplexity responses, or with explicit user corrections
Train LoRA adapters on the filtered data weekly and A/B test before full deployment — the framework includes an evaluation module for this

Code Example

snippet

# Core logic of UNO - Example of rule extraction and preference pair generation

from transformers import AutoTokenizer, AutoModelForCausalLM

DISTILL_PROMPT = """
You are analyzing a user's feedback on an AI response.
Given the dialogue below, extract actionable editing rules.

Dialogue:
User query: {query}
AI response: {response}
User feedback: {feedback}

Output a numbered list of specific, actionable rules to improve the response.
If no meaningful feedback exists, output: EMPTY
"""

REVISE_PROMPT = """
Revise the following response according to the given rules.

Original query: {query}
Original response: {response}
Rules to apply:
{rules}

Revised response:
"""

def distill_rules(model, tokenizer, query, response, feedback):
    prompt = DISTILL_PROMPT.format(
        query=query, response=response, feedback=feedback
    )
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=256)
    rules = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return None if "EMPTY" in rules else rules

def build_preference_pair(model, tokenizer, query, orig_response, rules):
    """chosen = improved response with rules applied, rejected = original response"""
    revised = model.generate_with_prompt(
        REVISE_PROMPT.format(
            query=query, response=orig_response, rules=rules
        )
    )
    return {
        "prompt": query,
        "chosen": revised,   # y_w
        "rejected": orig_response  # y_l
    }

# Measuring Cognitive Gap (using reranker)
def compute_cognitive_gap(reranker, query, user_rules, llm_predicted_rules):
    """
    user_rules: actual rules extracted from user feedback
    llm_predicted_rules: rules predicted by LLM without user logs
    lower gap = area the model already knows → Expert LoRA is safe
    higher gap = area the model doesn't know → switch to Critic LoRA
    """
    scores = reranker.compute_relevance(user_rules, llm_predicted_rules)
    return 1 - min(scores)  # gap based on minimum value

Terminology

LoRA adapterA small set of additional weight matrices trained on top of a frozen base model. Adapters can be swapped in/out without reloading the full model.

noise filteringThe process of removing low-quality or misleading training examples from a dataset before fine-tuning.

continuous fine-tuningAn ongoing cycle of collecting new data, training, and deploying updated model versions — as opposed to one-shot fine-tuning.

catastrophic forgettingWhen fine-tuning on new data causes a model to lose performance on tasks it previously handled well.

Related Resources

Original Abstract (Expand)

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .