Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Mar 20, 2026•Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall +2•View PDF

TL;DR Highlight

An LLM memory system that compresses conversations into semantic triples, cutting tokens by 95% while maintaining top-tier accuracy.

Who Should Read

Backend/AI developers building AI agents that support multi-turn conversations or long sessions. Especially teams looking for context cost reduction and conversation history management strategies.

Core Mechanics

Instead of feeding raw conversations, compress into subject-predicate-object semantic triples to reduce noise and improve retrieval accuracy.
A 2-layer structure that supplements triples with conversation summaries to capture the 'why' and 'flow' that triples alone miss.
Gemma-300 embeddings + FAISS + BM25 hybrid retrieval for precise extraction of relevant memories.
Answer generation with GPT-4.1-mini achieved 1st place (81.95%) among retrieval-based systems on the LoCoMo benchmark.
Higher accuracy than Zep, LangMem, and Mem0 while using 67% fewer tokens than Zep.
SDK design wraps existing LLM clients with minimal code changes for easy integration.

Evidence

LoCoMo benchmark overall accuracy: Memori 81.95% vs Zep 79.09% vs LangMem 78.05% vs Mem0 62.47%.
Average tokens per query: Memori 1,294 vs Zep 3,911 vs Full-Context 26,031 (Memori uses only 4.97% of the full conversation).
Cost comparison: 20x+ cheaper than Full-Context, 67% cheaper than Zep (at $0.001035/query with GPT-4.1-mini).
Single-hop reasoning hit 87.87%, within 0.66pp of the Full-Context ceiling (88.53%).

How to Apply

If your current RAG pipeline embeds raw text chunks, try adding an LLM-based semantic triple extraction preprocessing step before storage — it reduces retrieval noise and token costs.
If your multi-session chatbot stuffs the entire conversation history into prompts, wrap it with Memori SDK and switch to triple + summary-based retrieval to solve the context length explosion problem.
When building answer generation prompts, separate 'Memories (timestamped triples)' and 'Summaries (conversation summaries)' so the model can better reason about timelines and change history (see Appendix A prompts).

Code Example

snippet

# Memori SDK usage example (conceptual workflow)
# pip install memori-sdk

from memori import MemoriClient
import openai

# Wrap existing OpenAI client with Memori
client = MemoriClient(
    llm_client=openai.OpenAI(),
    user_id="user_123",
    session_id="session_abc"
)

# Advanced Augmentation runs automatically when saving conversation
# → semantic triple extraction + conversation summary generation
client.add_message(role="user", content="나 이번 주 제주도 여행 가")
client.add_message(role="assistant", content="좋겠다! 며칠 동안 가?")

# On new query, automatically retrieves relevant triples + summaries and builds prompt
response = client.chat(
    messages=[{"role": "user", "content": "내가 언제 제주도 간다고 했지?"}]
)
# Internally generated prompt structure:
# Memories: [(user, travel_destination, 제주도, timestamp: 2024-01-15)]
# Summaries: "User mentioned a travel plan to Jeju Island this week"
# → Generates accurate response within ~1,294 tokens

Terminology

semantic tripleKnowledge decomposed into the minimal 'subject-predicate-object' unit. Example: (user, lives_in, Seoul). Much smaller and more searchable than full sentences.

LoCoMo benchmarkAn AI evaluation dataset testing the ability to remember and reason over information across multiple long conversation sessions.

context rotWhen too much information in the prompt causes the model to miss important content. Like not finding key points in an overly thick book.

FAISSFacebook's high-speed vector search library. Quickly finds similar items among millions of embeddings.

BM25A traditional search algorithm based on keyword frequency. Often combined with embedding search for higher accuracy in hybrid retrieval.

LLM-as-a-JudgeUsing another LLM instead of humans to evaluate model-generated answers. Automates large-scale evaluation.

persistent memoryMemory that persists even after the app closes or sessions change. Normal LLM conversations lose memory when sessions end.

Full-ContextStuffing the entire conversation history into the prompt as-is. High accuracy but explosive token costs.

Related Resources

Original Abstract (Expand)

As large language models (LLMs) evolve into autonomous agents, persistent memory at the API layer is essential for enabling context-aware behavior across LLMs and multi-session interactions. Existing approaches force vendor lock-in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM-agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost-efficient deployment.