On the Theoretical Limitations of Embedding-Based Retrieval
This paper explores the fundamental theoretical limitations of embedding-based retrieval, demonstrating that the number of retrievable top-k document subsets is inherently limited by the embedding dimension. It introduces the novel LIMIT dataset, revealing that even state-of-the-art models struggle with simple tasks due to these constraints. The work advocates for developing new retrieval methods beyond the current single-vector paradigm to overcome these limitations. ✨
Article Points:
1
Embedding dimension fundamentally limits retrievable top-k document combinations.
2
Theoretical limits are empirically confirmed even with best-case free embedding optimization.
3
The new LIMIT dataset exposes state-of-the-art models' failure on simple tasks.
4
Model performance on complex retrieval tasks is critically dependent on embedding dimension.
5
Existing academic benchmarks often hide these limitations by testing limited query spaces.
6
Alternative architectures like cross-encoders or multi-vector models are needed.
On the Theoretical Limitations of Embedding-Based Retrieval
Fundamental Limits

Embedding dimension limits top-k combinations

Connected to sign-rank of qrel matrix

Empirical Validation

Free embedding optimization confirms limits

Critical-n points show dimension breaks

Polynomial relationship models capacity

LIMIT Dataset

Realistic, simple, yet difficult task

Stress tests theoretical results

High qrel graph density

SOTA Model Performance

Models struggle on LIMIT dataset

Performance depends on embedding dimension

Lack of domain shift impact

Implications

Current benchmarks hide limitations

Challenges for instruction-based retrieval

Future Directions

Alternative architectures needed

Cross-encoders, multi-vector, sparse models