End-to-end document widget detection pipeline using YOLO11 on CommonForms dataset
Project description
Widget Detection Pipeline
End-to-end document form widget detection using YOLO11m trained on the CommonForms dataset.
Detects 3 classes of form fields from scanned PDFs and document images:
| Class ID | Name | Description |
|---|---|---|
| 0 | text_input |
Text boxes, input lines |
| 1 | choice_button |
Checkboxes + radio buttons |
| 2 | signature |
Signature fields |
Requirements
- Python 3.11+
- uv (
pip install uv) - CUDA GPU with ≥ 12 GB VRAM (RTX 3080Ti / A2000 12GB / etc.) — training at 1024px with batch=4
Setup
# 1. Install uv if not already installed
pip install uv
# 2. Create venv and install all dependencies
uv sync
# 3. (Optional) Install dev dependencies for testing/linting
uv sync --extra dev
Pipeline
Step 1 — Download Dataset (CommonForms subset)
Streams 50,000 images from HuggingFace (no full 163GB download needed):
uv run scripts/download_dataset.py --max-images 50000
Options:
--max-images N— number of images (default: 50,000)--token HF_TOKEN— HuggingFace token if needed--seed 42— reproducibility seed
Output: data/raw/images/ + data/raw/annotations/
Step 2 — Convert to YOLO Format
uv run scripts/convert_to_yolo.py
Options:
--val-ratio 0.1— validation split (default: 10%)--seed 42
Output: data/yolo/ with images/, labels/, data.yaml
Step 3 — Verify Dataset
# Check integrity
uv run scripts/verify_dataset.py
# Visual inspection (draws 20 sample images with bboxes)
uv run scripts/verify_dataset.py --draw-samples 20
Step 4 — Train
# Full training (100 epochs, batch=4, 1024px)
uv run train.py --config configs/train_config.yaml
# Smoke test (3 epochs, quick sanity check)
uv run train.py --config configs/train_config.yaml --smoke-test
# Resume from last checkpoint
uv run train.py --config configs/train_config.yaml --resume
Training output: runs/detect/widget_yolo11m/
Step 5 — Run Inference
# PDF input → JSON output
uv run inference.py \
--input form.pdf \
--model runs/detect/widget_yolo11m/weights/best.pt
# Image input with lower confidence threshold
uv run inference.py \
--input scan.jpg \
--model best.pt \
--conf 0.2
# Batch of PDFs with visual overlay
uv run inference.py \
--input "forms/*.pdf" \
--model best.pt \
--visualize \
--output-dir outputs/
# High DPI for dense forms
uv run inference.py --input form.pdf --model best.pt --dpi 300
Output Format
{
"source": "form.pdf",
"total_pages": 3,
"total_widgets": 24,
"pages": [
{
"source": "form.pdf",
"page": 1,
"image_width": 1654,
"image_height": 2339,
"processing_time_ms": 142.3,
"widgets": [
{
"class_id": 0,
"class_name": "text_input",
"confidence": 0.913,
"bbox": {
"x1": 120.0, "y1": 340.0, "x2": 480.0, "y2": 380.0,
"x1_norm": 0.073, "y1_norm": 0.145,
"x2_norm": 0.290, "y2_norm": 0.163
},
"page": 1
}
]
}
]
}
Run Tests
uv run pytest tests/ -v
Training Config Highlights (12 GB GPU)
| Parameter | Value | Reason |
|---|---|---|
imgsz |
1024 | Small widget detection needs high resolution |
batch |
4 | Safe for 12 GB VRAM at 1024px |
amp |
true | Mixed precision — reduces VRAM ~40% |
epochs |
100 | With early stopping (patience=20) |
degrees |
10.0 | Rotation for skewed scans |
perspective |
0.0005 | Real-world document distortion |
mosaic |
1.0 | Key augmentation for small widgets |
albumentations |
auto | Blur + noise when installed |
Project Structure
Widget_detection1/
├── widget_detector/ # Core library
│ ├── config.py # Paths, class maps, defaults
│ ├── dataset.py # HF download + YOLO conversion
│ ├── detector.py # WidgetDetector inference class
│ ├── output.py # Pydantic result models
│ ├── pdf_utils.py # PDF → PIL images (PyMuPDF)
│ └── trainer.py # Training wrapper
├── scripts/
│ ├── download_dataset.py # Step 1: Download
│ ├── convert_to_yolo.py # Step 2: Convert
│ └── verify_dataset.py # Step 3: Verify
├── configs/
│ └── train_config.yaml # YOLO11m hyperparameters
├── train.py # Training entry point
├── inference.py # Inference entry point
├── tests/ # Unit tests
└── pyproject.toml # uv project manifest
Notes
- CommonForms
choice_buttonincludes both checkboxes and radio buttons as one class (the dataset does not distinguish them). If you need to split them, a heuristic post-processor can be added based on bbox aspect ratio. - Training is set to 3 classes (
text_input,choice_button,signature) matching CommonForms exactly. - The
data/andruns/directories are gitignored — do not commit them.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file psynx_widget_detector-0.1.0.tar.gz.
File metadata
- Download URL: psynx_widget_detector-0.1.0.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccdad364c38222152afc3b44b4ff5b538bc602637a14752fd2fc41f27cbb4b14
|
|
| MD5 |
fc78baec6175a3a28d8c28ec2f2ec5d2
|
|
| BLAKE2b-256 |
68301cf2f004db0972af6292af93c006f722c25fce58890668dc26c82048f9d5
|
File details
Details for the file psynx_widget_detector-0.1.0-py3-none-any.whl.
File metadata
- Download URL: psynx_widget_detector-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c16760a0d6464e65361f517cb7d2378efb8b6974a1cba37e11f5def657c939d
|
|
| MD5 |
099ca0d953ee4c7e32710eebdbed50b8
|
|
| BLAKE2b-256 |
099b835315b8224cad3b0f6342ef2c2b0a307d54fb1e5e32f6c22e9805b0ad3b
|