COMMONFORMS: A Large, Diverse Dataset for Form Field Detection
This paper introduces CommonForms, a web-scale dataset for form field detection, framing the problem as object detection. It comprises approximately 55k documents and over 450k pages, filtered from Common Crawl, showcasing diverse languages and domains. The authors also present FFDNet-Small and FFDNet-Large, open-source models that achieve high average precision and outperform commercial PDF readers in detecting text, choice, and signature fields. ✨
Article Points:
1
CommonForms: Large, diverse dataset for form field detection from Common Crawl.
2
FFDNet models: Open-source, high-resolution detectors outperforming commercial tools.
3
Form field detection: Framed as object detection for Text Input, Choice Button, Signature.
4
Dataset filtering: Rigorous process improves data efficiency and annotation quality.
5
High-resolution inputs: Essential for accurate form field detection performance.
6
FFDNet advantage: Predicts checkboxes, unlike popular commercial PDF readers.
COMMONFORMS: A Large, Diverse Dataset for Form Field Detection
Dataset((COMMONFORMS))

Source[Common Crawl PDFs]

Size[59k docs, 480k pages]

Diversity[Multi-language, multi-domain]

Preparation[Rigorous filtering pipeline]

Models((FFDNet))

Types[FFDNet-Small, FFDNet-Large]

Architecture[YOLO11-based object detectors]

Training[High-resolution 1216px, cost < $500]

Performance[High AP, outperforms Adobe Acrobat]

Task((Form Field Detection))

Definition[Detect location & type]

Problem Type[Object detection]

Goal[Prepare fillable forms from flat PDFs]

Key Insights((Insights))

Quantity has quality[Leverage existing fillable forms]

High-resolution inputs[Crucial for accuracy]

Filtering[Improves data efficiency]

Contributions((Contributions))

Release COMMONFORMS dataset

Release FFDNet models & code

Dataset analysis[Language, domain]

Comparison[Outperforms commercial systems]