Skip to main content

End-to-end document widget detection pipeline using YOLO11 on CommonForms dataset

Project description

Widget Detection Pipeline

End-to-end document form widget detection using YOLO11m trained on the CommonForms dataset.

Detects 3 classes of form fields from scanned PDFs and document images:

Class ID Name Description
0 text_input Text boxes, input lines
1 choice_button Checkboxes + radio buttons
2 signature Signature fields

Requirements

  • Python 3.11+
  • uv (pip install uv)
  • CUDA GPU with ≥ 12 GB VRAM (RTX 3080Ti / A2000 12GB / etc.) — training at 1024px with batch=4

Setup

# 1. Install uv if not already installed
pip install uv

# 2. Create venv and install all dependencies
uv sync

# 3. (Optional) Install dev dependencies for testing/linting
uv sync --extra dev

Pipeline

Step 1 — Download Dataset (CommonForms subset)

Streams 50,000 images from HuggingFace (no full 163GB download needed):

uv run scripts/download_dataset.py --max-images 50000

Options:

  • --max-images N — number of images (default: 50,000)
  • --token HF_TOKEN — HuggingFace token if needed
  • --seed 42 — reproducibility seed

Output: data/raw/images/ + data/raw/annotations/


Step 2 — Convert to YOLO Format

uv run scripts/convert_to_yolo.py

Options:

  • --val-ratio 0.1 — validation split (default: 10%)
  • --seed 42

Output: data/yolo/ with images/, labels/, data.yaml


Step 3 — Verify Dataset

# Check integrity
uv run scripts/verify_dataset.py

# Visual inspection (draws 20 sample images with bboxes)
uv run scripts/verify_dataset.py --draw-samples 20

Step 4 — Train

# Full training (100 epochs, batch=4, 1024px)
uv run train.py --config configs/train_config.yaml

# Smoke test (3 epochs, quick sanity check)
uv run train.py --config configs/train_config.yaml --smoke-test

# Resume from last checkpoint
uv run train.py --config configs/train_config.yaml --resume

Training output: runs/detect/widget_yolo11m/


Step 5 — Run Inference

# PDF input → JSON output
uv run inference.py \
    --input form.pdf \
    --model runs/detect/widget_yolo11m/weights/best.pt

# Image input with lower confidence threshold
uv run inference.py \
    --input scan.jpg \
    --model best.pt \
    --conf 0.2

# Batch of PDFs with visual overlay
uv run inference.py \
    --input "forms/*.pdf" \
    --model best.pt \
    --visualize \
    --output-dir outputs/

# High DPI for dense forms
uv run inference.py --input form.pdf --model best.pt --dpi 300

Output Format

{
  "source": "form.pdf",
  "total_pages": 3,
  "total_widgets": 24,
  "pages": [
    {
      "source": "form.pdf",
      "page": 1,
      "image_width": 1654,
      "image_height": 2339,
      "processing_time_ms": 142.3,
      "widgets": [
        {
          "class_id": 0,
          "class_name": "text_input",
          "confidence": 0.913,
          "bbox": {
            "x1": 120.0, "y1": 340.0, "x2": 480.0, "y2": 380.0,
            "x1_norm": 0.073, "y1_norm": 0.145,
            "x2_norm": 0.290, "y2_norm": 0.163
          },
          "page": 1
        }
      ]
    }
  ]
}

Run Tests

uv run pytest tests/ -v

Training Config Highlights (12 GB GPU)

Parameter Value Reason
imgsz 1024 Small widget detection needs high resolution
batch 4 Safe for 12 GB VRAM at 1024px
amp true Mixed precision — reduces VRAM ~40%
epochs 100 With early stopping (patience=20)
degrees 10.0 Rotation for skewed scans
perspective 0.0005 Real-world document distortion
mosaic 1.0 Key augmentation for small widgets
albumentations auto Blur + noise when installed

Project Structure

Widget_detection1/
├── widget_detector/          # Core library
│   ├── config.py             # Paths, class maps, defaults
│   ├── dataset.py            # HF download + YOLO conversion
│   ├── detector.py           # WidgetDetector inference class
│   ├── output.py             # Pydantic result models
│   ├── pdf_utils.py          # PDF → PIL images (PyMuPDF)
│   └── trainer.py            # Training wrapper
├── scripts/
│   ├── download_dataset.py   # Step 1: Download
│   ├── convert_to_yolo.py    # Step 2: Convert
│   └── verify_dataset.py     # Step 3: Verify
├── configs/
│   └── train_config.yaml     # YOLO11m hyperparameters
├── train.py                  # Training entry point
├── inference.py              # Inference entry point
├── tests/                    # Unit tests
└── pyproject.toml            # uv project manifest

Notes

  • CommonForms choice_button includes both checkboxes and radio buttons as one class (the dataset does not distinguish them). If you need to split them, a heuristic post-processor can be added based on bbox aspect ratio.
  • Training is set to 3 classes (text_input, choice_button, signature) matching CommonForms exactly.
  • The data/ and runs/ directories are gitignored — do not commit them.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

psynx_widget_detector-0.1.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

psynx_widget_detector-0.1.0-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file psynx_widget_detector-0.1.0.tar.gz.

File metadata

  • Download URL: psynx_widget_detector-0.1.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for psynx_widget_detector-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ccdad364c38222152afc3b44b4ff5b538bc602637a14752fd2fc41f27cbb4b14
MD5 fc78baec6175a3a28d8c28ec2f2ec5d2
BLAKE2b-256 68301cf2f004db0972af6292af93c006f722c25fce58890668dc26c82048f9d5

See more details on using hashes here.

File details

Details for the file psynx_widget_detector-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for psynx_widget_detector-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c16760a0d6464e65361f517cb7d2378efb8b6974a1cba37e11f5def657c939d
MD5 099ca0d953ee4c7e32710eebdbed50b8
BLAKE2b-256 099b835315b8224cad3b0f6342ef2c2b0a307d54fb1e5e32f6c22e9805b0ad3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page