Google Titans architecture, helping AI have long-term memory
TL;DR Highlight
Google's new architecture handles 2M token contexts without Transformer's quadratic complexity — using 'Titans' memory blocks instead of attention.
Who Should Read
ML researchers working on long-context models, and engineers building RAG or document processing systems who care about context length scaling.
Core Mechanics
- The architecture replaces standard attention with 'Titans' — memory units that store compressed representations of past context and can be retrieved efficiently, avoiding the quadratic cost of full attention over 2M tokens.
- Two types of memory: short-term (sliding window attention for recent tokens) and long-term (Titans persistent memory blocks trained to compress and retrieve key information).
- The long-term memory is a small neural network that learns to compress context — it's trained end-to-end with the main model, not a fixed external retrieval system.
- On benchmarks requiring long-range recall (e.g., needle-in-haystack at 2M tokens), performance is competitive with or better than full attention models at much lower compute cost.
- The architecture has implications for very long document understanding, multi-session memory, and any task where full-context attention is currently prohibitively expensive.
Evidence
- Google published benchmark results showing the Titans architecture matches or beats standard Transformers on long-context tasks while being significantly more efficient.
- HN commenters noted this is in the lineage of SSM/Mamba-style architectures but with a different approach to learned memory — some skepticism about whether benchmark gains hold in practice.
- Several researchers flagged that 'compressing' context into fixed-size memory inherently loses information — the question is whether the compression is smart enough for real tasks.
- The approach was compared favorably to RAG for some use cases, since the memory is learned end-to-end rather than relying on a separate retrieval index.
How to Apply
- If you're hitting context window limits on document processing tasks, watch this architecture closely — it could enable processing entire codebases or document repositories in a single forward pass.
- For multi-turn agents that need long memory without growing context costs, Titans-style architectures may be worth experimenting with as they become available in open-source form.
- RAG architects should benchmark against learned memory approaches for tasks where chunking and retrieval introduce too much information loss.
Code Example
# Core logic of Surprise-based memory update in Titans architecture (conceptual code)
import torch
import torch.nn as nn
class NeuralMemory(nn.Module):
def __init__(self, dim, memory_lr=0.01):
super().__init__()
# Long-term memory = small MLP (key-value associative memory)
self.memory_mlp = nn.Sequential(
nn.Linear(dim, dim * 2),
nn.SiLU(),
nn.Linear(dim * 2, dim)
)
self.memory_lr = memory_lr # Memory update rate
def compute_surprise(self, query, target):
"""Surprise = magnitude of prediction error (gradient norm)"""
pred = self.memory_mlp(query)
loss = nn.functional.mse_loss(pred, target)
grad = torch.autograd.grad(loss, self.memory_mlp.parameters())
surprise = sum(g.norm() for g in grad) # More surprising = stronger memorization
return surprise, loss
def update_memory(self, key, value, forget_gate):
"""Write surprising information into memory (test-time update)"""
surprise, loss = self.compute_surprise(key, value)
# Forgetting Gate: decay old memories + write new information
effective_lr = self.memory_lr * surprise.item() * forget_gate
for param in self.memory_mlp.parameters():
if param.grad is not None:
param.data -= effective_lr * param.grad
def recall(self, query):
"""Retrieve information from long-term memory using a query"""
return self.memory_mlp(query)
# Example usage with MAC (Memory as Context) variant
# memory_output = neural_memory.recall(current_query)
# context = torch.cat([memory_output, short_term_kv_cache], dim=1)
# output = attention(query, context) # Integration of long-term + short-term memoryTerminology
Related Papers
1-Bit Bonsai Image 4B Image Generation for Local Devices
4B 파라미터 이미지 생성 모델의 가중치를 1비트/3값으로 극단적으로 압축해서 iPhone에서도 돌아가게 만든 모델. 7.75GB짜리 diffusion transformer를 0.93GB까지 줄였다.
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
vLLM의 핵심 기능을 C++와 CUDA로 직접 구현하며 배울 수 있는 교육용 LLM 추론 엔진 프로젝트로, 소스코드와 단계별 강의가 함께 제공된다.
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
Kog AI가 8× AMD MI300X에서 요청당 3,000 tokens/s를 달성하는 LLM 추론 엔진을 공개했고, 기존 소프트웨어 스택의 병목을 GPU 메모리 대역폭 최대화로 풀어냈다는 내용이다.
A sleep-like consolidation mechanism for LLMs
LLM이 긴 컨텍스트를 처리할 때 발생하는 Attention 비용 문제를 해결하기 위해, 사람의 수면처럼 주기적으로 컨텍스트를 fast weight에 압축·저장하는 새로운 메커니즘을 제안한 논문이다.
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
GPU에서 Transformer 학습 시 발생하는 메모리 병목을 해결하기 위해, 정규화·활성화 등 소규모 연산들을 GEMM 출력이 칩 위에 있는 동안 함께 실행하는 커널 추상화 CODA를 소개한다. LLM이 이 추상화를 활용해 고성능 커널을 자동 생성할 수 있다는 점이 특히 주목받고 있다.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
모델 수정 없이 KV 캐시를 청크 간 누산기로 쓰면 128K 토큰까지 100% 정확도로 정보를 검색할 수 있다.