Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents
TL;DR Highlight
A framework that trains LLM agents via reinforcement learning to self-manage long-term and short-term memory through tool calls, without separate memory modules
Who Should Read
AI agent developers dealing with context explosion and memory management in multi-turn conversations or long-horizon tasks. Developers who have integrated memory systems like LangMem or Mem0 in production but have hit performance ceilings.
Core Mechanics
- LTM (Long-Term Memory) and STM (Short-Term Memory) are directly integrated into the agent policy rather than treated as separate modules — exposed as 6 tools: ADD/UPDATE/DELETE (for LTM) + RETRIEVE/SUMMARY/FILTER (for STM)
- 3-stage progressive RL training: Stage 1 (build LTM while viewing information) → Stage 2 (filter STM amid distracting information) → Stage 3 (integrate LTM retrieval + STM management + final reasoning)
- step-wise GRPO (Group Relative Policy Optimization) addresses the sparse and discontinuous reward problem of memory operations — backpropagating final task outcomes to all intermediate memory decisions
- Reward function designed across 3 dimensions — task completion, context management, and memory quality — improving both convergence speed and memory quality over using simple correctness rewards alone
- After RL training, agents more aggressively use ADD/UPDATE, and FILTER call frequency surges from 0.02 to 0.31 on Qwen2.5-7B — context cleanup happens autonomously without explicit rules
- Outperforms all 4 baselines (LangMem, A-Mem, Mem0, Mem0g graph-based) across all benchmarks on both Qwen2.5-7B and Qwen3-4B
Evidence
- Qwen2.5-7B average across 5 benchmarks: 41.96% — +4.82%p over the strongest baseline Mem0 (37.14%), and +49.59% relative over no-memory (28.05%)
- Qwen3-4B average: 54.31% — +8.57%p over A-Mem (45.74%), with RL training alone contributing +8.72%p over AgeMem-noRL
- Memory Quality (MQ) score: 0.605 on Qwen3-4B — +0.018 over runner-up A-Mem (0.587), and +0.190 over the Answer-Only reward strategy (0.415)
- STM tools reduce token usage by 3.1–5.1% when replacing RAG — Qwen3-4B: 2310 tokens (RAG) → 2191 tokens (AgeMem)
How to Apply
- When exposing memory operations as tool calls, implement all 6 interfaces — ADD/UPDATE/DELETE (LTM) + RETRIEVE/SUMMARY/FILTER (STM) — and enforce a <think>→<tool_call>→<answer> structure in the agent system prompt to make memory operations parseable
- In long-conversation systems suffering from context explosion, replace RAG with SUMMARY/FILTER tools — add a preemptive action reward like Rpreventive to encourage the agent to call these tools before overflow occurs
- Attaching the tools alone (without RL training) already outperforms baselines, but fine-tuning with GRPO using training data structured around the paper's 3-stage scenarios (information gathering → distracting information → final query) yields an additional 8%+ improvement
Code Example
Terminology
Related Resources
Original Abstract (Expand)
Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.