Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Jan 5, 2026•Yi Yu, Liuyi Yao, Yuexiang Xie +4•View PDF

TL;DR Highlight

A framework that trains LLM agents via reinforcement learning to self-manage long-term and short-term memory through tool calls, without separate memory modules

Who Should Read

AI agent developers dealing with context explosion and memory management in multi-turn conversations or long-horizon tasks. Developers who have integrated memory systems like LangMem or Mem0 in production but have hit performance ceilings.

Core Mechanics

LTM (Long-Term Memory) and STM (Short-Term Memory) are directly integrated into the agent policy rather than treated as separate modules — exposed as 6 tools: ADD/UPDATE/DELETE (for LTM) + RETRIEVE/SUMMARY/FILTER (for STM)
3-stage progressive RL training: Stage 1 (build LTM while viewing information) → Stage 2 (filter STM amid distracting information) → Stage 3 (integrate LTM retrieval + STM management + final reasoning)
step-wise GRPO (Group Relative Policy Optimization) addresses the sparse and discontinuous reward problem of memory operations — backpropagating final task outcomes to all intermediate memory decisions
Reward function designed across 3 dimensions — task completion, context management, and memory quality — improving both convergence speed and memory quality over using simple correctness rewards alone
After RL training, agents more aggressively use ADD/UPDATE, and FILTER call frequency surges from 0.02 to 0.31 on Qwen2.5-7B — context cleanup happens autonomously without explicit rules
Outperforms all 4 baselines (LangMem, A-Mem, Mem0, Mem0g graph-based) across all benchmarks on both Qwen2.5-7B and Qwen3-4B

Evidence

Qwen2.5-7B average across 5 benchmarks: 41.96% — +4.82%p over the strongest baseline Mem0 (37.14%), and +49.59% relative over no-memory (28.05%)
Qwen3-4B average: 54.31% — +8.57%p over A-Mem (45.74%), with RL training alone contributing +8.72%p over AgeMem-noRL
Memory Quality (MQ) score: 0.605 on Qwen3-4B — +0.018 over runner-up A-Mem (0.587), and +0.190 over the Answer-Only reward strategy (0.415)
STM tools reduce token usage by 3.1–5.1% when replacing RAG — Qwen3-4B: 2310 tokens (RAG) → 2191 tokens (AgeMem)

How to Apply

When exposing memory operations as tool calls, implement all 6 interfaces — ADD/UPDATE/DELETE (LTM) + RETRIEVE/SUMMARY/FILTER (STM) — and enforce a <think>→<tool_call>→<answer> structure in the agent system prompt to make memory operations parseable
In long-conversation systems suffering from context explosion, replace RAG with SUMMARY/FILTER tools — add a preemptive action reward like Rpreventive to encourage the agent to call these tools before overflow occurs
Attaching the tools alone (without RL training) already outperforms baselines, but fine-tuning with GRPO using training data structured around the paper's 3-stage scenarios (information gathering → distracting information → final query) yields an additional 8%+ improvement

Code Example

snippet

Terminology

LTMLong-Term Memory. Memory that persists across conversation sessions. Like human long-term memory — remembering the name of someone you met today, the next day.

STMShort-Term Memory. Content within the context window of the current ongoing conversation. Like human short-term memory — what was just said moments ago in this conversation.

GRPOGroup Relative Policy Optimization. A reinforcement learning technique that generates multiple responses simultaneously and reinforces the relatively better ones. Learning by comparing outputs against each other, without a reference answer.

RLReinforcement Learning. A learning approach where optimal behavior is acquired through trial and error guided by reward signals. Similar to how a game player learns strategies on their own by maximizing score.

RAGRetrieval-Augmented Generation. A method that retrieves relevant documents from an external database at query time and appends them to the context. Compensates for LLM knowledge limitations through external retrieval.

KL divergenceA metric that measures how different two probability distributions are. Used during RL training to constrain how far the model drifts from its original weights, preventing overly drastic updates.

컨텍스트 창(Context Window)The maximum length of text an LLM can process at once. Once this limit is exceeded, the model loses track of earlier parts of the conversation.

Related Resources

Original Abstract (Expand)

Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.