SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
SmolDocling is an ultra-compact (256M parameters) vision-language model designed for end-to-end document conversion. It processes entire pages by generating DocTags, a new universal markup format that captures content, structure, and spatial location of document elements. The model demonstrates robust performance, competing with VLMs up to 27 times larger, while significantly reducing computational requirements. ✨
Article Points:
1
Ultra-compact VLM (256M params) for end-to-end document conversion.
2
Introduces DocTags: a universal markup for content, structure, and location.
3
Outperforms larger VLMs in text recognition, layout, and structure extraction.
4
Contributes novel public datasets for charts, tables, equations, and code.
5
Handles diverse document types beyond commonly focused scientific papers.
6
Achieves competitive performance with significantly reduced computational needs.
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Core Concept(Ultra-compact VLM)

256M Parameters

End-to-end conversion

Vision-language model

Key Innovation(DocTags Markup)

Universal markup format

Content, structure, location

Optimized for LLM

Performance(Outperforms larger models)

Text recognition

Layout analysis

Table structure

Equation recognition

Code listing

Contributions(New Public Datasets)

DocLayNet-PT

SynthChartNet

SynthCodeNet

SynthFormulaNet

Document Scope(Diverse Document Types)

Business documents

Academic papers

Technical reports

Patents, forms

Efficiency(Reduced Computational Needs)

Low VRAM usage

Fast conversion time

Competitive with larger models