Memori: LLM 에이전트를 위한 효율적이고 맥락 인식 가능한 Persistent Memory Layer

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Mar 20, 2026•Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall +2•View PDF

TL;DR Highlight

대화 내용을 semantic triple로 압축해서 토큰은 95% 줄이고 정확도는 최상위 수준을 유지하는 LLM 메모리 시스템

Who Should Read

멀티턴 대화나 장기 세션을 지원하는 AI 에이전트를 개발 중인 백엔드/AI 개발자. 특히 컨텍스트 비용 절감과 대화 히스토리 관리 전략을 고민하는 팀.

Core Mechanics

대화 원문을 그대로 넣는 대신 subject-predicate-object 형태의 semantic triple로 압축 → 노이즈 줄이고 검색 정확도 향상
triple만으론 부족한 '왜'와 '흐름' 정보를 보완하기 위해 conversation summary를 함께 저장하는 2레이어 구조
Gemma-300 임베딩 + FAISS + BM25 하이브리드 검색으로 관련 메모리만 정밀 추출
GPT-4.1-mini로 답변 생성, LoCoMo 벤치마크에서 retrieval 기반 시스템 중 1위(81.95%) 달성
Zep, LangMem, Mem0 대비 높은 정확도를 유지하면서도 토큰 사용량은 Zep 대비 67% 절감
SDK 형태로 기존 LLM 클라이언트를 래핑해서 코드 변경 최소화로 붙일 수 있는 구조

Evidence

LoCoMo 벤치마크 전체 정확도: Memori 81.95% vs Zep 79.09% vs LangMem 78.05% vs Mem0 62.47%
쿼리당 평균 토큰 수: Memori 1,294 vs Zep 3,911 vs Full-Context 26,031 (Memori는 전체 대화의 4.97%만 사용)
비용 비교: Full-Context 대비 20배 이상 저렴, Zep 대비 67% 비용 절감 (GPT-4.1-mini 기준 쿼리당 $0.001035)
Single-hop 추론에서 87.87% 달성, Full-Context 천장(88.53%)과 0.66%p 차이로 거의 동등

How to Apply

기존 RAG 파이프라인에서 원문 청크를 통째로 임베딩하고 있다면, 전처리 단계에서 LLM으로 semantic triple 추출 후 저장하는 구조로 바꿔보면 검색 노이즈가 줄고 토큰 비용도 감소함
멀티세션 챗봇에서 대화 히스토리 전체를 프롬프트에 넣고 있다면 Memori SDK로 래핑해서 triple + summary 기반 검색으로 교체 → 컨텍스트 길이 폭발 문제 해결 가능
답변 생성 프롬프트에 'Memories(타임스탬프 포함 triple)'와 'Summaries(대화 요약)'를 분리해서 넣으면, 모델이 시간 순서나 변경 이력을 더 잘 추론함 (Appendix A 프롬프트 참고)

Code Example

snippet

# Memori SDK 사용 예시 (개념적 워크플로우)
# pip install memori-sdk

from memori import MemoriClient
import openai

# 기존 OpenAI 클라이언트를 Memori로 래핑
client = MemoriClient(
    llm_client=openai.OpenAI(),
    user_id="user_123",
    session_id="session_abc"
)

# 대화 저장 시 자동으로 Advanced Augmentation 실행
# → semantic triple 추출 + conversation summary 생성
client.add_message(role="user", content="나 이번 주 제주도 여행 가")
client.add_message(role="assistant", content="좋겠다! 며칠 동안 가?")

# 새 쿼리 시 자동으로 관련 triple + summary 검색 후 프롬프트 구성
response = client.chat(
    messages=[{"role": "user", "content": "내가 언제 제주도 간다고 했지?"}]
)
# 내부적으로 생성되는 프롬프트 구조:
# Memories: [(user, travel_destination, 제주도, timestamp: 2024-01-15)]
# Summaries: "사용자가 이번 주 제주도 여행 계획을 언급함"
# → 1,294 토큰 내외로 정확한 답변 생성

Terminology

semantic triple지식을 '주어-서술어-목적어' 형태로 쪼갠 최소 단위. 예: (user, lives_in, Seoul). 문장보다 훨씬 작고 검색하기 쉬움.

LoCoMo benchmark여러 세션에 걸친 긴 대화에서 정보를 기억하고 추론하는 능력을 테스트하는 AI 평가 데이터셋.

context rot프롬프트에 정보가 너무 많아지면 모델이 중요한 내용을 오히려 놓치는 현상. 책이 너무 두꺼우면 핵심 내용을 못 찾는 것과 같음.

FAISSFacebook이 만든 고속 벡터 검색 라이브러리. 수백만 개의 임베딩 중에서 유사한 것을 빠르게 찾아줌.

BM25키워드 빈도 기반의 전통적인 검색 알고리즘. 임베딩 검색과 함께 쓰면 정확도가 올라가는 하이브리드 검색에 자주 활용됨.

LLM-as-a-Judge모델이 생성한 답변을 사람 대신 다른 LLM이 평가하는 방식. 대규모 평가를 자동화할 때 사용.

persistent memory앱을 껐다 켜도, 세션이 바뀌어도 기억이 유지되는 메모리. 일반적인 LLM 대화는 세션이 끝나면 기억을 잃음.

Full-Context대화 히스토리 전체를 프롬프트에 그대로 넣는 방식. 정확도는 높지만 토큰 비용이 폭발적으로 증가함.

Related Resources

Original Abstract (Expand)

As large language models (LLMs) evolve into autonomous agents, persistent memory at the API layer is essential for enabling context-aware behavior across LLMs and multi-session interactions. Existing approaches force vendor lock-in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM-agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost-efficient deployment.