Titans: Learning to Memorize at Test Time

Previous Card

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

This paper introduces Titans, a new family of neural architectures featuring a novel long-term neural memory module. This module learns to memorize historical context at test time, complementing attention's role as short-term memory for current context. Titans demonstrate superior performance over Transformers and modern linear recurrent models, effectively scaling to context windows larger than 2M tokens. ✨

Article Points:

Titans introduce a neural long-term memory module.

Neural memory learns to memorize historical context at test time.

Combines attention (short-term) with neural memory (long-term).

Neural memory uses gradient-based surprise, momentum, and forgetting.

Titans outperform Transformers and linear recurrent models.

Scales effectively to over 2M context window sizes.

Source:

Titans: Learning to Memorize at Test Time

long context memory transformer

Core Concept

Neural Long-Term Memory

Learns at test time

Deep non-linear memory

Surprise metric: gradient

Momentum & Forgetting

Architecture Variants

Memory as a Context (MAC)

Memory as a Gate (MAG)

Memory as a Layer (MAL)

Persistent Memory

Memory Mechanism

Short-term: Attention

Long-term: Neural Memory

Adaptive forgetting

Parallelizable training

Advantages

Scales to 2M+ context

Higher accuracy

Theoretically more expressive

Fast inference

Experimental Results

Language Modeling

Needle in Haystack

Time Series Forecasting

DNA Modeling

Outperforms baselines

Ablation Study

Weight decay crucial

Momentum important

Convolution beneficial

Persistent memory helps

Deep memory effective

Source:

Titans: Learning to Memorize at Test Time

Next Card

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

Neural Long-Term Memory

Learns at test time

Deep non-linear memory

Surprise metric: gradient

Momentum & Forgetting

Memory as a Context (MAC)

Memory as a Gate (MAG)

Memory as a Layer (MAL)

Persistent Memory

Short-term: Attention

Long-term: Neural Memory

Adaptive forgetting

Parallelizable training

Scales to 2M+ context

Higher accuracy

Theoretically more expressive

Fast inference

Language Modeling

Needle in Haystack

Time Series Forecasting

DNA Modeling

Outperforms baselines

Weight decay crucial

Momentum important

Convolution beneficial

Persistent memory helps

Deep memory effective

Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

Related Cards

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

TEMPO: PROMPT-BASED GENERATIVE PRE-TRAINED TRANSFORMER FOR TIME SERIES FORECASTING

Large Concept Models: Language Modeling in a Sentence Representation Space