Signature detection and role attribution for PDFs
Project description
CaseWorks.Automation.CaseDocumentIntake
sigdetect
sigdetect is a small Python library + CLI that detects e-signature evidence in PDFs and infers the signer role (e.g., patient, attorney, representative).
It looks for:
- Real signature form fields (
/Widgetannotations with/FT /Sig) - AcroForm signature fields present only at the document level
- Common vendor markers (e.g., DocuSign, “Signature Certificate”)
- Page labels (like “Signature of Patient” or “Signature of Parent/Guardian”)
It returns a structured summary per file (pages, counts, roles, hints, etc.) that can be used downstream.
Contents
- Quick start
- CLI usage
- Library usage
- Result schema
- Configuration & rules
- Smoke tests
- Dev workflow
- Troubleshooting
- License
Quick start
Requirements
- Python 3.9+ (developed & tested on 3.11)
- macOS / Linux / WSL
Setup
# 1) Create and activate a virtualenv (example uses Python 3.11)
python3.11 -m venv .venv
source .venv/bin/activate
# 2) Install in editable (dev) mode
python -m pip install --upgrade pip
pip install -e .
Sanity check
# Run unit & smoke tests
pytest -q
CLI usage
The project ships a Typer-based CLI (exposed either as sigdetect or runnable via python -m sigdetect.cli, depending on how it is installed).
sigdetect --help
# or
python -m sigdetect.cli --help
Detect (per-file summary)
# Execute detection according to the YAML configuration
sigdetect detect \
--config ./sample_data/config.yml \
--profile hipaa # or: retainer
Notes
- The config file controls
pdf_root,out_dir,engine,pseudo_signatures,recurse_xobjects, etc. --engineaccepts auto (default; prefers PyMuPDF when installed, falls back to PyPDF2), pypdf2, or pymupdf.--pseudo-signaturesenables a vendor/Acro-only pseudo-signature when no actual/Widgetis present (useful for DocuSign / Acrobat Sign receipts).--recurse-xobjectsallows scanning Form XObjects for vendor markers and labels embedded in page resources.--profileselects tuned role logic:hipaa→ patient / representative / attorneyretainer→ client / firm (prefers detecting two signatures)
--recursive/--no-recursivetoggles whethersigdetect detectdescends into subdirectories when hunting for PDFs (recursive by default).- Cropping (
--crop-signatures) and wet detection (--detect-wet) are enabled by default for single-pass runs; disable them if you want a light, e-sign-only pass. PyMuPDF is required for crops; PyMuPDF + Tesseract are required for wet detection. - If the executable is not on
PATH, you can always fall back topython -m sigdetect.cli ....
EDA (quick aggregate stats)
sigdetect eda \
--config ./sample_data/config.yml
Library usage
from pathlib import Path
from sigdetect.config import DetectConfiguration
from sigdetect.detector.pypdf2_engine import PyPDF2Detector
configuration = DetectConfiguration(
PdfRoot=Path("/path/to/pdfs"),
OutputDirectory=Path("./out"),
Engine="pypdf2",
PseudoSignatures=True,
RecurseXObjects=True,
Profile="retainer", # or "hipaa"
)
detector = PyPDF2Detector(configuration)
result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
print(result.to_dict())
Detect(Path) returns a FileResult dataclass; call .to_dict() for the JSON-friendly representation (see Result schema). Each signature entry now exposes bounding_box coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, crop_path points at the generated image. Use Engine="auto" if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.
Library API (embed in another script)
Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping:
from pathlib import Path
from sigdetect.api import (
CropSignatureImages,
DetectMany,
DetectPdf,
ScanDirectory,
ToCsvRow,
Version,
get_detector,
)
print("sigdetect", Version())
# 1) Single file → dict
result = DetectPdf(
"/path/to/file.pdf",
profileName="retainer",
includePseudoSignatures=True,
recurseXObjects=True,
)
print(
result["file"],
result["pages"],
result["esign_found"],
result["sig_count"],
result["sig_pages"],
result["roles"],
result["hints"],
)
# 2) Directory walk (generator of dicts)
for res in ScanDirectory(
"/path/to/pdfs",
profileName="hipaa",
includePseudoSignatures=True,
recurseXObjects=True,
):
# store in DB, print, etc.
pass
# 3) Crop PNG snippets for FileResult objects (requires PyMuPDF)
detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
CropSignatureImages(
"/path/to/pdfs/example.pdf",
file_result,
outputDirectory="./signature_crops",
dpi=200,
)
Result schema
High-level summary (per file):
{
"file": "example.pdf",
"size_kb": 123.4,
"pages": 3,
"esign_found": true,
"scanned_pdf": false,
"mixed": false,
"sig_count": 2,
"sig_pages": "1,3",
"roles": "patient;representative",
"hints": "AcroSig:sig_patient;VendorText:DocuSign\\s+Envelope\\s+ID",
"signatures": [
{
"page": 1,
"field_name": "sig_patient",
"role": "patient",
"score": 5,
"scores": { "field": 3, "page_label": 2 },
"evidence": ["field:patient", "page_label:patient"],
"hint": "AcroSig:sig_patient",
"render_type": "typed",
"bounding_box": [10.0, 10.0, 150.0, 40.0],
"crop_path": "signature_crops/example/sig_01_patient.png"
},
{
"page": null,
"field_name": "vendor_or_acro_detected",
"role": "representative",
"score": 6,
"scores": { "page_label": 4, "general": 2 },
"evidence": ["page_label:representative(parent/guardian)", "pseudo:true"],
"hint": "VendorOrAcroOnly",
"render_type": "typed",
"bounding_box": null,
"crop_path": null
}
]
}
Field notes
esign_foundistrueif any signature widget, AcroForm/Sigfield, or vendor marker is detected.scanned_pdfis a heuristic: pages with images only and no extractable text.mixedmeans bothesign_foundandscanned_pdfaretrue.rolessummarizes unique non-unknownroles across signatures.- In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
signatures[].bounding_boxreports the widget rectangle in PDF points (origin bottom-left).signatures[].crop_pathis populated when PNG crops are generated (via CLI--crop-signaturesorCropSignatureImages).
Configuration & rules
Built-in rules live under src/sigdetect/data/:
vendor_patterns.yml– vendor byte/text patterns (e.g., DocuSign, Acrobat Sign).role_rules.yml– signer-role logic:labels– strong page labels (e.g., “Signature of Patient”, including Parent/Guardian cases)general– weaker role hints in surrounding textfield_hints– field-name keywords (e.g.,sig_patient)doc_hard– strong document-level triggers (relationship to patient, “minor/unable to sign”, first-person consent)weights– scoring weights for the above
role_rules.retainer.yml– retainer-specific rules (labels for client/firm, general tokens, and field hints).
You can keep one config YAML per dataset, e.g.:
# ./sample_data/config.yml (example)
pdf_root: ./pdfs
out_dir: ./sigdetect_out
engine: pypdf2
pseudo_signatures: true
recurse_xobjects: true
profile: retainer # or: hipaa
crop_signatures: false # enable to write PNG crops (requires pymupdf)
# crop_output_dir: ./signature_crops
crop_image_dpi: 200
detect_wet_signatures: false # opt-in OCR wet detection (PyMuPDF + Tesseract)
wet_ocr_dpi: 200
wet_ocr_languages: eng
wet_precision_threshold: 0.82
YAML files can be customized or load at runtime (see CLI --config, if available, or import and pass patterns into engine).
Key detection behaviors
- Widget-first in mixed docs: if a real
/Widgetexists, no pseudo “VendorOrAcroOnly” signature is emitted. - Acro-only dedupe: multiple
/Sigfields at the document level collapse to a single pseudo signature. - Parent/Guardian label: “Signature of Parent/Guardian” maps to the
representativerole. - Field-name fallbacks: role hints are pulled from
/T,/TU, or/TM(in that order).- Retainer heuristics:
- Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
- Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
- When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.
- Wet detection (opt-in): With
detect_wet_signatures: true, the CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection. It emitsRenderType="wet"signatures for high-confidence label/stroke pairs in the lower page region. Missing OCR dependencies add aManualReview:*hint instead of failing.
Smoke tests
Drop-in smoke tests live under tests/ and cover:
- Vendor-only (multiple markers)
- Acro-only (single pseudo with multiple
/Sig) - Mixed (real widget + vendor markers → widget role, no pseudo)
- Field-name fallbacks (
/TU,/TM) - Parent/Guardian label →
representative - Encrypted PDFs (graceful handling)
Run a subset:
pytest -q -k smoke
# or specific files:
pytest -q tests/test_mixed_widget_vendor_smoke.py
Debugging
If you need to debug or inspect the detection logic, you can run the CLI with --debug:
from pathlib import Path
from sigdetect.config import DetectConfiguration
from sigdetect.detector.pypdf2_engine import PyPDF2Detector
pdf = Path("/path/to/one.pdf")
configuration = DetectConfiguration(
PdfRoot=pdf.parent,
OutputDirectory=Path("."),
Engine="pypdf2",
Profile="retainer",
PseudoSignatures=True,
RecurseXObjects=True,
)
print(PyPDF2Detector(configuration).Detect(pdf).to_dict())
Dev workflow
Project layout
src/
sigdetect/
detector/
base.py
pypdf2_engine.py
data/
role_rules.yml
vendor_patterns.yml
cli.py
tests/
pyproject.toml
.pre-commit-config.yaml
Formatting & linting (pre-commit)
# one-time
pip install pre-commit
pre-commit install
# run on all files
pre-commit run --all-files
Hooks: black, isort, ruff, plus pytest (optional).
Ensure your virtualenv folders are excluded in .pre-commit-config.yaml (e.g., ^\.venv).
Typical loop
# run tests
pytest -q
# run only smoke tests while iterating
pytest -q -k smoke
Troubleshooting
Using the wrong Python
which python
python -V
If you see 3.8 or system Python, recreate the venv with 3.11.
ModuleNotFoundError: typer / click / pytest
pip install typer click pytest
Pre-commit reformats files in .venv
exclude: |
^(\.venv|\.venv311|dist|build)/
Vendor markers not detected
Set --recurse-xobjects true and enable pseudo signatures. Many providers embed markers in Form XObjects or compressed streams.
Parent/Guardian not recognized
The rules already include a fallback for “Signature of Parent/Guardian”; if your variant differs, add it to role_rules.yml → labels.representative.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sigdetect-0.4.0.tar.gz.
File metadata
- Download URL: sigdetect-0.4.0.tar.gz
- Upload date:
- Size: 51.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40bb808c20fe169d6e00ad01aac7b8d3a1233dcf21d08143a07cf1c2e482fb87
|
|
| MD5 |
d57553451457a3fa59a58438d2d32991
|
|
| BLAKE2b-256 |
854bf18bb9a70329ad92e8045c8da258c06543e9be842bc97f83ff29af74c5dd
|
Provenance
The following attestation bundles were made for sigdetect-0.4.0.tar.gz:
Publisher:
release.yml on Angeion-Group/sigdetect-hipaa-retainer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sigdetect-0.4.0.tar.gz -
Subject digest:
40bb808c20fe169d6e00ad01aac7b8d3a1233dcf21d08143a07cf1c2e482fb87 - Sigstore transparency entry: 768489154
- Sigstore integration time:
-
Permalink:
Angeion-Group/sigdetect-hipaa-retainer@fcd12f8ee861e7fd63124a3943fcdc218745de01 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Angeion-Group
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fcd12f8ee861e7fd63124a3943fcdc218745de01 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sigdetect-0.4.0-py3-none-any.whl.
File metadata
- Download URL: sigdetect-0.4.0-py3-none-any.whl
- Upload date:
- Size: 48.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b16b2f3699cfe43465c161a7644cc47c5505e3203fb036caf67e60aeb9c2c1e
|
|
| MD5 |
d70f9941d673d245327440fa68693ad4
|
|
| BLAKE2b-256 |
85953860101096715a1c58ac3aab2a93af3df50a7ecd609829e6a79b4c933199
|
Provenance
The following attestation bundles were made for sigdetect-0.4.0-py3-none-any.whl:
Publisher:
release.yml on Angeion-Group/sigdetect-hipaa-retainer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sigdetect-0.4.0-py3-none-any.whl -
Subject digest:
0b16b2f3699cfe43465c161a7644cc47c5505e3203fb036caf67e60aeb9c2c1e - Sigstore transparency entry: 768489157
- Sigstore integration time:
-
Permalink:
Angeion-Group/sigdetect-hipaa-retainer@fcd12f8ee861e7fd63124a3943fcdc218745de01 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Angeion-Group
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fcd12f8ee861e7fd63124a3943fcdc218745de01 -
Trigger Event:
push
-
Statement type: