This paper explores the fundamental theoretical limitations of embedding-based retrieval, demonstrating that the number of retrievable top-k document subsets is inherently limited by the embedding dimension. It introduces the novel LIMIT dataset, revealing that even state-of-the-art models struggle with simple tasks due to these constraints. The work advocates for developing new retrieval methods beyond the current single-vector paradigm to overcome these limitations. ✨
Article Points:
1
Embedding dimension fundamentally limits retrievable top-k document combinations.
2
Theoretical limits are empirically confirmed even with best-case free embedding optimization.
3
The new LIMIT dataset exposes state-of-the-art models' failure on simple tasks.
4
Model performance on complex retrieval tasks is critically dependent on embedding dimension.
5
Existing academic benchmarks often hide these limitations by testing limited query spaces.
6
Alternative architectures like cross-encoders or multi-vector models are needed.
Fundamental Limits
Embedding dimension limits top-k combinations
Connected to sign-rank of qrel matrix
Empirical Validation
Free embedding optimization confirms limits
Critical-n points show dimension breaks
Polynomial relationship models capacity
LIMIT Dataset
Realistic, simple, yet difficult task
Stress tests theoretical results
High qrel graph density
SOTA Model Performance
Models struggle on LIMIT dataset
Performance depends on embedding dimension
Lack of domain shift impact
Implications
Current benchmarks hide limitations
Challenges for instruction-based retrieval
Future Directions
Alternative architectures needed
Cross-encoders, multi-vector, sparse models
Next Card