SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Previous Card

GPT-5 prompting guide

rag ocr document conversion vision language model

SmolDocling is an ultra-compact (256M parameters) vision-language model designed for end-to-end document conversion. It processes entire pages by generating DocTags, a new universal markup format that captures content, structure, and spatial location of document elements. The model demonstrates robust performance, competing with VLMs up to 27 times larger, while significantly reducing computational requirements. ✨

Article Points:

Ultra-compact VLM (256M params) for end-to-end document conversion.

Introduces DocTags: a universal markup for content, structure, and location.

Outperforms larger VLMs in text recognition, layout, and structure extraction.

Contributes novel public datasets for charts, tables, equations, and code.

Handles diverse document types beyond commonly focused scientific papers.

Achieves competitive performance with significantly reduced computational needs.

Source:

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

rag ocr document conversion vision language model

Core Concept(Ultra-compact VLM)

256M Parameters

End-to-end conversion

Vision-language model

Key Innovation(DocTags Markup)

Universal markup format

Content, structure, location

Optimized for LLM

Performance(Outperforms larger models)

Text recognition

Layout analysis

Table structure

Equation recognition

Code listing

Contributions(New Public Datasets)

DocLayNet-PT

SynthChartNet

SynthCodeNet

SynthFormulaNet

Document Scope(Diverse Document Types)

Business documents

Academic papers

Technical reports

Patents, forms

Efficiency(Reduced Computational Needs)

Low VRAM usage

Fast conversion time

Competitive with larger models

Source:

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Next Card

GPT-5 prompting guide

256M Parameters

End-to-end conversion

Vision-language model

Universal markup format

Content, structure, location

Optimized for LLM

Text recognition

Layout analysis

Table structure

Equation recognition

Code listing

DocLayNet-PT

SynthChartNet

SynthCodeNet

SynthFormulaNet

Business documents

Academic papers

Technical reports

Patents, forms

Low VRAM usage

Fast conversion time

Competitive with larger models

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

Related Cards

AGENT KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

RAG vs KAG: A Comparative Analysis of Retrieval-Augmented Generation and Knowledge-Augmented Generation

Building a web search engine from scratch in two months with 3 billion neural embeddings