Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents
TL;DR Highlight
An LLM memory system that compresses conversations into semantic triples, cutting tokens by 95% while maintaining top-tier accuracy.
Who Should Read
Backend/AI developers building AI agents that support multi-turn conversations or long sessions. Especially teams looking for context cost reduction and conversation history management strategies.
Core Mechanics
- Instead of feeding raw conversations, compress into subject-predicate-object semantic triples to reduce noise and improve retrieval accuracy.
- A 2-layer structure that supplements triples with conversation summaries to capture the 'why' and 'flow' that triples alone miss.
- Gemma-300 embeddings + FAISS + BM25 hybrid retrieval for precise extraction of relevant memories.
- Answer generation with GPT-4.1-mini achieved 1st place (81.95%) among retrieval-based systems on the LoCoMo benchmark.
- Higher accuracy than Zep, LangMem, and Mem0 while using 67% fewer tokens than Zep.
- SDK design wraps existing LLM clients with minimal code changes for easy integration.
Evidence
- LoCoMo benchmark overall accuracy: Memori 81.95% vs Zep 79.09% vs LangMem 78.05% vs Mem0 62.47%.
- Average tokens per query: Memori 1,294 vs Zep 3,911 vs Full-Context 26,031 (Memori uses only 4.97% of the full conversation).
- Cost comparison: 20x+ cheaper than Full-Context, 67% cheaper than Zep (at $0.001035/query with GPT-4.1-mini).
- Single-hop reasoning hit 87.87%, within 0.66pp of the Full-Context ceiling (88.53%).
How to Apply
- If your current RAG pipeline embeds raw text chunks, try adding an LLM-based semantic triple extraction preprocessing step before storage — it reduces retrieval noise and token costs.
- If your multi-session chatbot stuffs the entire conversation history into prompts, wrap it with Memori SDK and switch to triple + summary-based retrieval to solve the context length explosion problem.
- When building answer generation prompts, separate 'Memories (timestamped triples)' and 'Summaries (conversation summaries)' so the model can better reason about timelines and change history (see Appendix A prompts).
Code Example
# Memori SDK usage example (conceptual workflow)
# pip install memori-sdk
from memori import MemoriClient
import openai
# Wrap existing OpenAI client with Memori
client = MemoriClient(
llm_client=openai.OpenAI(),
user_id="user_123",
session_id="session_abc"
)
# Advanced Augmentation runs automatically when saving conversation
# → semantic triple extraction + conversation summary generation
client.add_message(role="user", content="나 이번 주 제주도 여행 가")
client.add_message(role="assistant", content="좋겠다! 며칠 동안 가?")
# On new query, automatically retrieves relevant triples + summaries and builds prompt
response = client.chat(
messages=[{"role": "user", "content": "내가 언제 제주도 간다고 했지?"}]
)
# Internally generated prompt structure:
# Memories: [(user, travel_destination, 제주도, timestamp: 2024-01-15)]
# Summaries: "User mentioned a travel plan to Jeju Island this week"
# → Generates accurate response within ~1,294 tokensTerminology
Related Resources
Original Abstract (Expand)
As large language models (LLMs) evolve into autonomous agents, persistent memory at the API layer is essential for enabling context-aware behavior across LLMs and multi-session interactions. Existing approaches force vendor lock-in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM-agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost-efficient deployment.