Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Mar 17, 2026•Sahil Sen, Elias Lumer, Anmol Gulati +1•View PDF

TL;DR Highlight

A memory framework that structures time-based events from conversation history to answer questions like 'what did I do last month?' with 95.6% accuracy

Who Should Read

Backend developers adding long-term conversation memory to chatbots or AI assistants. Especially those building personalization features that track 'when users did what.'

Core Mechanics

Instead of just storing conversations, maintains two structures simultaneously: an 'event calendar' with <subject, verb, object> + date ranges, and a 'turn calendar' with original conversation text
Normalizes time expressions to ISO 8601 datetime ranges — converts vague expressions like 'recently' and 'last month' into actual date ranges for filtering
Dynamic prompting generates retrieval guides per query — analyzes questions like 'what's the most recent camera lens I bought?' and instructs the agent on what and how to search
ReAct-pattern agent iteratively calls two tools (vector search + grep), performing additional searches when evidence is insufficient
Achieved 92.60% on LongMemEvalS benchmark with GPT-4o (Chronos Low), 95.60% with Claude Opus 4.6 (Chronos High)
Removing event calendar drops accuracy by 34.5 points — ablation confirms time structuring is the key performance driver

Evidence

Chronos Low (GPT-4o) 92.60% vs previous best EmergenceMem Internal 86.00% — 7.67% absolute improvement
Chronos High (Claude Opus 4.6) 95.60%, +3.02% over previous best for LongMemEvalS SOTA
Ablation: removing events calendar drops from 93.1% to 58.6% (34.5pp drop) — single largest component contribution
Multi-session aggregation category: 91.73% — 7.97% relative improvement over 2nd place Honcho

How to Apply

When storing conversations, use an LLM to extract <subject, verb, object, start_datetime, end_datetime> tuples into a separate index. Fill in date ranges for expressions like 'last week' or 'recently' by calculating from the conversation timestamp.
Add a 'dynamic prompting' step before query processing — feed the question to a lightweight model like Gemini Flash to generate a guide on 'what info to search for in what time range,' then inject into the agent's system prompt.
Provide agents with both vector search (semantic-based) and grep (exact keyword matching) tools. Letting the agent choose based on context dramatically improves recall on exact matches like specific product names or dates.

Code Example

snippet

# Event extraction prompt example
EVENT_EXTRACTION_PROMPT = """
Given the conversation turn below (timestamp: {tconv}), extract all temporally-grounded events.
For each event, output JSON with:
- subject: who/what
- verb: action
- object: what was acted upon  
- start_datetime: ISO 8601 (earliest possible)
- end_datetime: ISO 8601 (latest possible)
- aliases: 2-4 paraphrases using different vocabulary

Rules:
- 'recently' → compute window relative to {tconv}
- 'last month' → first to last day of previous month from {tconv}
- Only extract events with clear subject+verb+object

Conversation turn:
{turn_text}
"""

# Dynamic prompting example
DYNAMIC_PROMPT_META = """
Analyze this memory query and output 1-5 bullet points describing:
- What specific information to retrieve
- What time ranges to filter by
- How to approach multi-hop reasoning if needed

Query: {user_query}
Current date: {current_date}

Output format:
Pay close attention to the following information (current and past):
• [bullet 1]
• [bullet 2]
...
"""

# Agent tool definitions
tools = [
    {"name": "search_events", "description": "Semantic search over event calendar. Use for time-grounded queries."},
    {"name": "search_turns", "description": "Semantic search over raw conversation turns."},
    {"name": "grep_events", "description": "Exact keyword search on event calendar."},
    {"name": "grep_turns", "description": "Exact keyword search on conversation turns."},
]

Terminology

ReActA pattern where the LLM cycles through 'think → call tool → check result → think again' to solve problems. Like how a person searches, reads, and makes judgments about things they don't know.

event calendarA structured DB storing 'when what happened' from conversations. Records linking dates and actions like 'John bought running shoes on 2024-03-15.'

turn calendarA DB storing original conversation text organized by session. Paired with the event calendar and used for semantic search.

dynamic promptingA technique that auto-generates different retrieval guides per question. Creates tailored instructions for the agent like 'this question needs March date filtering.'

multi-hop reasoningWhen answering a single question requires multiple steps of search and reasoning. 'What did I do the week after vacation?' = first find vacation dates → then search events for that following week.

ISO 8601International standard date/time format. In the form '2024-03-15T09:00:00Z,' ensuring all systems understand dates the same way.

cross-encoder rerankingA process where a model re-scores search candidates more precisely. A two-stage approach: fast first search pulls 100 results, then a slower but accurate model narrows to 15.

LongMemEvalSA benchmark evaluating long-term memory performance across 6 categories including temporal reasoning and multi-session aggregation from months of conversations. 500 questions.

Related Resources

Original Abstract (Expand)

Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.