Previous Card
GPT-5 prompting guide
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
SmolDocling is an ultra-compact (256M parameters) vision-language model designed for end-to-end document conversion. It processes entire pages by generating DocTags, a new universal markup format that captures content, structure, and spatial location of document elements. The model demonstrates robust performance, competing with VLMs up to 27 times larger, while significantly reducing computational requirements. ✨
Article Points:
1
Ultra-compact VLM (256M params) for end-to-end document conversion.
2
Introduces DocTags: a universal markup for content, structure, and location.
3
Outperforms larger VLMs in text recognition, layout, and structure extraction.
4
Contributes novel public datasets for charts, tables, equations, and code.
5
Handles diverse document types beyond commonly focused scientific papers.
6
Achieves competitive performance with significantly reduced computational needs.
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Core Concept(Ultra-compact VLM)
256M Parameters
End-to-end conversion
Vision-language model
Key Innovation(DocTags Markup)
Universal markup format
Content, structure, location
Optimized for LLM
Performance(Outperforms larger models)
Text recognition
Layout analysis
Table structure
Equation recognition
Code listing
Contributions(New Public Datasets)
DocLayNet-PT
SynthChartNet
SynthCodeNet
SynthFormulaNet
Document Scope(Diverse Document Types)
Business documents
Academic papers
Technical reports
Patents, forms
Efficiency(Reduced Computational Needs)