Learning Facts at Scale with Active Reading
This paper introduces Active Reading, a novel framework that trains large language models (LLMs) to reliably learn and recall facts from a given corpus by employing self-generated learning strategies. It demonstrates significant improvements in factual recall on expert domains, outperforming vanilla finetuning and other data augmentation methods. The approach scales to pre-training, resulting in more factual models like Meta WikiExpert-8B. ✨
Article Points:
1
Active Reading framework uses self-generated learning strategies for LLMs.
2
Significantly improves factual recall on expert domains (160-312% relative gains).
3
Meta WikiExpert-8B (8B params) outperforms larger models on factual QA.
4
Active Reading data diversity drives stronger scaling trends and performance.
5
Scaling requires higher learning rates and mixing pre-training data for robust learning.
6
Offers a scalable approach for building more factual base models.
Learning Facts at Scale with Active Reading
Concept

Human-inspired learning

Self-generated strategies

Reliable fact recall

Methodology

Two-stage data generation

Diverse learning strategies

Task-agnostic & task-specific

Key Findings

Improved factual recall (160-312% relative)

Outperforms finetuning

Meta WikiExpert-8B excels

Scaling

Effective at pre-training scale

Higher learning rate

Mix pre-training data

Stronger scaling trends

Analysis

Increased data diversity

Model size impact

Not primarily coverage

Future Directions

Role of pretraining data

Parametric vs RAG

Scaling training paradigm