Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation
TL;DR Highlight
Compresses AI coding agent conversation histories 11x into searchable memory — with almost no quality loss on vector search.
Who Should Read
Developers building coding agents or any long-running AI agents that accumulate long conversation histories, and teams working on agent memory systems.
Core Mechanics
- Long coding agent conversation histories become expensive and slow due to context length limits
- Proposed a compression approach that reduces conversation history by 11x while maintaining searchability
- Compressed history is stored as a structured, retrievable memory rather than raw text
- Vector search over compressed memory achieves near-identical retrieval quality vs full history
- The compression preserves semantically important information while discarding redundant details
- Works with any coding agent framework without architectural changes
Evidence
- 11x compression ratio achieved on coding agent conversation histories
- Vector search quality on compressed memory within 2-3% of full-history baseline
- End-to-end agent performance on coding tasks maintained or improved after compression
- Significant reduction in token costs and latency for long-running agent sessions
How to Apply
- Integrate the compression layer after each conversation turn to incrementally compress and store old context
- Use the vector search interface for retrieving relevant past context instead of including full history in every prompt
- Set a compression trigger threshold (e.g., after N turns or M tokens) to balance freshness vs compression ratio
Code Example
# Distillation prompt (based on Appendix B, using Claude Haiku 4.5)
prompt = """
Distill this conversation exchange into JSON:
- "exchange_core": 1-2 sentences. What was accomplished or decided?
Use the specific terms from the exchange. Do not invent details
not present in the text.
- "specific_context": One concrete detail from the text: a number,
error message, parameter name, or file path. Copy it exactly.
- "room_assignments": 1-3 rooms. Each room is a topic this exchange
belongs to. {"room_type": "<file|concept|workflow>",
"room_key": "<identifier>", "room_label": "<short label>",
"relevance": <0.0-1.0>}
Project: {project_id}
Exchange (messages {ply_start}-{ply_end}):
{messages_text}
Respond with ONLY valid JSON.
"""
# files_touched is NOT LLM-generated — extracted via regex
import re
def extract_files_touched(exchange_text):
pattern = r'[\w./\-]+\.(?:py|ts|js|go|rs|yaml|json|toml|md)'
return list(set(re.findall(pattern, exchange_text)))
# Embedding + indexing
from sentence_transformers import SentenceTransformer
import faiss
model = SentenceTransformer('all-MiniLM-L6-v2') # 22M params, CPU OK
def build_distill_index(palace_objects):
texts = [f"{obj['exchange_core']}\n{obj['specific_context']}"
for obj in palace_objects]
embeddings = model.encode(texts, show_progress_bar=True)
index = faiss.IndexFlatL2(384) # Exact search
index.add(embeddings)
return index, texts
# Cross-layer search: BM25 on verbatim + HNSW on distilled
from rank_bm25 import BM25Okapi
def cross_layer_search(query, verbatim_texts, distilled_texts,
distill_index, top_k=10):
# BM25 on verbatim
tokenized = [t.split() for t in verbatim_texts]
bm25 = BM25Okapi(tokenized)
bm25_scores = bm25.get_scores(query.split())
# Vector search on distilled
q_emb = model.encode([query])
_, vec_indices = distill_index.search(q_emb, top_k)
# RRF fusion
bm25_ranks = {i: r+1 for r, i in enumerate(bm25_scores.argsort()[::-1][:top_k])}
vec_ranks = {i: r+1 for r, i in enumerate(vec_indices[0])}
all_ids = set(bm25_ranks) | set(vec_ranks)
rrf_scores = {i: 1/(60 + bm25_ranks.get(i, top_k+1)) +
1/(60 + vec_ranks.get(i, top_k+1))
for i in all_ids}
return sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]Terminology
Related Resources
Original Abstract (Expand)
Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.