Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Mar 13, 2026•Sydney Lewis•View PDF

TL;DR Highlight

Compresses AI coding agent conversation histories 11x into searchable memory — with almost no quality loss on vector search.

Who Should Read

Developers building coding agents or any long-running AI agents that accumulate long conversation histories, and teams working on agent memory systems.

Core Mechanics

Long coding agent conversation histories become expensive and slow due to context length limits
Proposed a compression approach that reduces conversation history by 11x while maintaining searchability
Compressed history is stored as a structured, retrievable memory rather than raw text
Vector search over compressed memory achieves near-identical retrieval quality vs full history
The compression preserves semantically important information while discarding redundant details
Works with any coding agent framework without architectural changes

Evidence

11x compression ratio achieved on coding agent conversation histories
Vector search quality on compressed memory within 2-3% of full-history baseline
End-to-end agent performance on coding tasks maintained or improved after compression
Significant reduction in token costs and latency for long-running agent sessions

How to Apply

Integrate the compression layer after each conversation turn to incrementally compress and store old context
Use the vector search interface for retrieving relevant past context instead of including full history in every prompt
Set a compression trigger threshold (e.g., after N turns or M tokens) to balance freshness vs compression ratio

Code Example

snippet

# Distillation prompt (based on Appendix B, using Claude Haiku 4.5)
prompt = """
Distill this conversation exchange into JSON:
- "exchange_core": 1-2 sentences. What was accomplished or decided?
  Use the specific terms from the exchange. Do not invent details
  not present in the text.
- "specific_context": One concrete detail from the text: a number,
  error message, parameter name, or file path. Copy it exactly.
- "room_assignments": 1-3 rooms. Each room is a topic this exchange
  belongs to. {"room_type": "<file|concept|workflow>",
  "room_key": "<identifier>", "room_label": "<short label>",
  "relevance": <0.0-1.0>}

Project: {project_id}
Exchange (messages {ply_start}-{ply_end}):
{messages_text}

Respond with ONLY valid JSON.
"""

# files_touched is NOT LLM-generated — extracted via regex
import re
def extract_files_touched(exchange_text):
    pattern = r'[\w./\-]+\.(?:py|ts|js|go|rs|yaml|json|toml|md)'
    return list(set(re.findall(pattern, exchange_text)))

# Embedding + indexing
from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer('all-MiniLM-L6-v2')  # 22M params, CPU OK

def build_distill_index(palace_objects):
    texts = [f"{obj['exchange_core']}\n{obj['specific_context']}" 
             for obj in palace_objects]
    embeddings = model.encode(texts, show_progress_bar=True)
    
    index = faiss.IndexFlatL2(384)  # Exact search
    index.add(embeddings)
    return index, texts

# Cross-layer search: BM25 on verbatim + HNSW on distilled
from rank_bm25 import BM25Okapi

def cross_layer_search(query, verbatim_texts, distilled_texts, 
                        distill_index, top_k=10):
    # BM25 on verbatim
    tokenized = [t.split() for t in verbatim_texts]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(query.split())
    
    # Vector search on distilled
    q_emb = model.encode([query])
    _, vec_indices = distill_index.search(q_emb, top_k)
    
    # RRF fusion
    bm25_ranks = {i: r+1 for r, i in enumerate(bm25_scores.argsort()[::-1][:top_k])}
    vec_ranks = {i: r+1 for r, i in enumerate(vec_indices[0])}
    
    all_ids = set(bm25_ranks) | set(vec_ranks)
    rrf_scores = {i: 1/(60 + bm25_ranks.get(i, top_k+1)) + 
                     1/(60 + vec_ranks.get(i, top_k+1)) 
                  for i in all_ids}
    
    return sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]

Terminology

Context LengthThe maximum number of tokens an LLM can process at once — a key limitation for long-running agents.

Vector SearchFinding semantically similar content by comparing embedding vectors — the basis for semantic memory retrieval.

Conversation CompressionReducing the token count of conversation history while preserving the key information needed for future reasoning.

Agent MemoryThe mechanism by which an AI agent stores and retrieves information across a long task or across sessions.

Related Resources

Searchat: Semantic search for AI coding agent conversations (GitHub)

Original Abstract (Expand)

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.