Google Titans architecture, helping AI have long-term memory

TL;DR Highlight

Google's new architecture handles 2M token contexts without Transformer's quadratic complexity — using 'Titans' memory blocks instead of attention.

Who Should Read

ML researchers working on long-context models, and engineers building RAG or document processing systems who care about context length scaling.

Core Mechanics

The architecture replaces standard attention with 'Titans' — memory units that store compressed representations of past context and can be retrieved efficiently, avoiding the quadratic cost of full attention over 2M tokens.
Two types of memory: short-term (sliding window attention for recent tokens) and long-term (Titans persistent memory blocks trained to compress and retrieve key information).
The long-term memory is a small neural network that learns to compress context — it's trained end-to-end with the main model, not a fixed external retrieval system.
On benchmarks requiring long-range recall (e.g., needle-in-haystack at 2M tokens), performance is competitive with or better than full attention models at much lower compute cost.
The architecture has implications for very long document understanding, multi-session memory, and any task where full-context attention is currently prohibitively expensive.

Evidence

Google published benchmark results showing the Titans architecture matches or beats standard Transformers on long-context tasks while being significantly more efficient.
HN commenters noted this is in the lineage of SSM/Mamba-style architectures but with a different approach to learned memory — some skepticism about whether benchmark gains hold in practice.
Several researchers flagged that 'compressing' context into fixed-size memory inherently loses information — the question is whether the compression is smart enough for real tasks.
The approach was compared favorably to RAG for some use cases, since the memory is learned end-to-end rather than relying on a separate retrieval index.

How to Apply

If you're hitting context window limits on document processing tasks, watch this architecture closely — it could enable processing entire codebases or document repositories in a single forward pass.
For multi-turn agents that need long memory without growing context costs, Titans-style architectures may be worth experimenting with as they become available in open-source form.
RAG architects should benchmark against learned memory approaches for tasks where chunking and retrieval introduce too much information loss.

Code Example

snippet

# Core logic of Surprise-based memory update in Titans architecture (conceptual code)
import torch
import torch.nn as nn

class NeuralMemory(nn.Module):
    def __init__(self, dim, memory_lr=0.01):
        super().__init__()
        # Long-term memory = small MLP (key-value associative memory)
        self.memory_mlp = nn.Sequential(
            nn.Linear(dim, dim * 2),
            nn.SiLU(),
            nn.Linear(dim * 2, dim)
        )
        self.memory_lr = memory_lr  # Memory update rate

    def compute_surprise(self, query, target):
        """Surprise = magnitude of prediction error (gradient norm)"""
        pred = self.memory_mlp(query)
        loss = nn.functional.mse_loss(pred, target)
        grad = torch.autograd.grad(loss, self.memory_mlp.parameters())
        surprise = sum(g.norm() for g in grad)  # More surprising = stronger memorization
        return surprise, loss

    def update_memory(self, key, value, forget_gate):
        """Write surprising information into memory (test-time update)"""
        surprise, loss = self.compute_surprise(key, value)
        # Forgetting Gate: decay old memories + write new information
        effective_lr = self.memory_lr * surprise.item() * forget_gate
        for param in self.memory_mlp.parameters():
            if param.grad is not None:
                param.data -= effective_lr * param.grad

    def recall(self, query):
        """Retrieve information from long-term memory using a query"""
        return self.memory_mlp(query)

# Example usage with MAC (Memory as Context) variant
# memory_output = neural_memory.recall(current_query)
# context = torch.cat([memory_output, short_term_kv_cache], dim=1)
# output = attention(query, context)  # Integration of long-term + short-term memory

Terminology

Titans memoryGoogle's architecture where small learned neural modules compress and store long-range context, replacing the need for full attention over the entire sequence.

Quadratic complexityThe computational cost of standard Transformer attention scales with the square of sequence length — doubling context length quadruples compute.

Sliding window attentionAn attention variant where each token only attends to a fixed-size window of recent tokens, enabling linear cost but losing long-range context.