COMMONFORMS: A Large, Diverse Dataset for Form Field Detection

Previous Card

Extract-0: A SPECIALIZED LANGUAGE MODEL FOR DOCUMENT INFORMATION EXTRACTION

This paper introduces CommonForms, a web-scale dataset for form field detection, framing the problem as object detection. It comprises approximately 55k documents and over 450k pages, filtered from Common Crawl, showcasing diverse languages and domains. The authors also present FFDNet-Small and FFDNet-Large, open-source models that achieve high average precision and outperform commercial PDF readers in detecting text, choice, and signature fields. ✨

Article Points:

CommonForms: Large, diverse dataset for form field detection from Common Crawl.

FFDNet models: Open-source, high-resolution detectors outperforming commercial tools.

Form field detection: Framed as object detection for Text Input, Choice Button, Signature.

Dataset filtering: Rigorous process improves data efficiency and annotation quality.

High-resolution inputs: Essential for accurate form field detection performance.

FFDNet advantage: Predicts checkboxes, unlike popular commercial PDF readers.

Source:

COMMONFORMS: A Large, Diverse Dataset for Form Field Detection

dataset object-detection ocr

Dataset((COMMONFORMS))

Source[Common Crawl PDFs]

Size[59k docs, 480k pages]

Diversity[Multi-language, multi-domain]

Preparation[Rigorous filtering pipeline]

Models((FFDNet))

Types[FFDNet-Small, FFDNet-Large]

Architecture[YOLO11-based object detectors]

Training[High-resolution 1216px, cost < $500]

Performance[High AP, outperforms Adobe Acrobat]

Task((Form Field Detection))

Definition[Detect location & type]

Problem Type[Object detection]

Goal[Prepare fillable forms from flat PDFs]

Key Insights((Insights))

Quantity has quality[Leverage existing fillable forms]

High-resolution inputs[Crucial for accuracy]

Filtering[Improves data efficiency]

Contributions((Contributions))

Release COMMONFORMS dataset

Release FFDNet models & code

Dataset analysis[Language, domain]

Comparison[Outperforms commercial systems]

Source:

COMMONFORMS: A Large, Diverse Dataset for Form Field Detection

Next Card

Extract-0: A SPECIALIZED LANGUAGE MODEL FOR DOCUMENT INFORMATION EXTRACTION

Source[Common Crawl PDFs]

Size[59k docs, 480k pages]

Diversity[Multi-language, multi-domain]

Preparation[Rigorous filtering pipeline]

Types[FFDNet-Small, FFDNet-Large]

Architecture[YOLO11-based object detectors]

Training[High-resolution 1216px, cost < $500]

Performance[High AP, outperforms Adobe Acrobat]

Definition[Detect location & type]

Problem Type[Object detection]

Goal[Prepare fillable forms from flat PDFs]

Quantity has quality[Leverage existing fillable forms]

High-resolution inputs[Crucial for accuracy]

Filtering[Improves data efficiency]

Release COMMONFORMS dataset

Release FFDNet models & code

Dataset analysis[Language, domain]

Comparison[Outperforms commercial systems]

How OpenAI uses Codex

Related Cards

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion