Signature detection and role attribution for PDFs

Project description

CaseWorks.Automation.CaseDocumentIntake

sigdetect

sigdetect is a small Python library + CLI that detects e-signature evidence in PDFs and infers the signer role (e.g., patient, attorney, representative).

It looks for:

Real signature form fields (/Widget annotations with /FT /Sig)
AcroForm signature fields present only at the document level
Common vendor markers (e.g., DocuSign, “Signature Certificate”)
Page labels (like “Signature of Patient” or “Signature of Parent/Guardian”)

It returns a structured summary per file (pages, counts, roles, hints, etc.) that can be used downstream.

Quick start
CLI usage
Library usage
Result schema
Configuration & rules
Smoke tests
Dev workflow
Troubleshooting
License

Quick start

Requirements

Python 3.9+ (developed & tested on 3.11)
macOS / Linux / WSL

Setup

# 1) Create and activate a virtualenv (example uses Python 3.11)
python3.11 -m venv .venv
source .venv/bin/activate

# 2) Install in editable (dev) mode
python -m pip install --upgrade pip
pip install -e .

Sanity check

# Run unit & smoke tests
pytest -q

CLI usage

The project ships a Typer-based CLI (exposed either as sigdetect or runnable via python -m sigdetect.cli, depending on how it is installed).

sigdetect --help
# or
python -m sigdetect.cli --help

Detect (per-file summary)

# Execute detection according to the YAML configuration
sigdetect detect \
  --config ./sample_data/config.yml \
  --profile hipaa            # or: retainer

Notes

The config file controls pdf_root, out_dir, engine, pseudo_signatures, recurse_xobjects, etc.
Engine selection is forced to auto (prefers PyMuPDF for geometry, falls back to PyPDF2); any configured engine value is overridden.
--pseudo-signatures enables a vendor/Acro-only pseudo-signature when no actual /Widget is present (useful for DocuSign / Acrobat Sign receipts).
--recurse-xobjects allows scanning Form XObjects for vendor markers and labels embedded in page resources.
--profile selects tuned role logic:
- hipaa → patient / representative / attorney
- retainer → client / firm (prefers detecting two signatures)
--recursive/--no-recursive toggles whether sigdetect detect descends into subdirectories when hunting for PDFs (recursive by default).
Results output is disabled by default; set write_results: true or pass --write-results when you need results.json (for EDA).
Cropping (--crop-signatures) writes PNG crops to disk by default; enable --crop-docx to write DOCX files instead of PNGs. --crop-bytes embeds base64 PNG data in signatures[].crop_bytes and, when --crop-docx is enabled, embeds DOCX bytes in signatures[].crop_docx_bytes. PyMuPDF is required for crops, and python-docx is required for DOCX output.
Wet detection runs automatically for non-e-sign PDFs when dependencies are available; missing OCR dependencies add a ManualReview:* hint instead of failing. PyMuPDF + Tesseract are required for wet detection.
If the executable is not on PATH, you can always fall back to python -m sigdetect.cli ....

EDA (quick aggregate stats)

sigdetect eda \
  --config ./sample_data/config.yml

sigdetect eda expects results.json; enable write_results: true when running detect.

Library usage

from pathlib import Path
from sigdetect.config import DetectConfiguration
from sigdetect.detector.pypdf2_engine import PyPDF2Detector

configuration = DetectConfiguration(
    PdfRoot=Path("/path/to/pdfs"),
    OutputDirectory=Path("./out"),
    Engine="pypdf2",
    PseudoSignatures=True,
    RecurseXObjects=True,
    Profile="retainer",   # or "hipaa"
)

detector = PyPDF2Detector(configuration)
result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
print(result.to_dict())

Detect(Path) returns a FileResult dataclass; call .to_dict() for the JSON-friendly representation (see Result schema). Each signature entry now exposes bounding_box coordinates (PDF points, origin bottom-left). When PNG cropping is enabled, crop_path points at the generated image; when DOCX cropping is enabled, crop_docx_path points at the generated doc. Use Engine="auto" if you want the single-pass defaults that prefer PyMuPDF (for geometry) when available.

Library API (embed in another script)

Minimal, plug-and-play API that returns plain dicts (JSON-ready) without side effects unless you opt into cropping. Engine selection is forced to auto (PyMuPDF preferred) to ensure geometry. Wet detection runs automatically for non-e-sign PDFs; pass runWetDetection=False to skip OCR.

from pathlib import Path

from sigdetect.api import (
    CropSignatureImages,
    DetectMany,
    DetectPdf,
    ScanDirectory,
    ToCsvRow,
    Version,
    get_detector,
)

print("sigdetect", Version())

# 1) Single file → dict
result = DetectPdf(
    "/path/to/file.pdf",
    profileName="retainer",
    includePseudoSignatures=True,
    recurseXObjects=True,
    # runWetDetection=False,  # disable OCR-backed wet detection if desired
)
print(
    result["file"],
    result["pages"],
    result["esign_found"],
    result["sig_count"],
    result["sig_pages"],
    result["roles"],
    result["hints"],
)


# 2) Directory walk (generator of dicts)
for res in ScanDirectory(
    "/path/to/pdfs",
    profileName="hipaa",
    includePseudoSignatures=True,
    recurseXObjects=True,
):
    # store in DB, print, etc.
    pass

# 3) Crop signature snippets for FileResult objects (requires PyMuPDF; DOCX needs python-docx)
detector = get_detector(pdfRoot="/path/to/pdfs", profileName="hipaa")
file_result = detector.Detect(Path("/path/to/pdfs/example.pdf"))
CropSignatureImages(
    "/path/to/pdfs/example.pdf",
    file_result,
    outputDirectory="./signature_crops",
    dpi=200,
)

Result schema

High-level summary (per file):

{
  "file": "example.pdf",
  "size_kb": 123.4,
  "pages": 3,
  "esign_found": true,
  "scanned_pdf": false,
  "mixed": false,
  "sig_count": 2,
  "sig_pages": "1,3",
  "roles": "patient;representative",
  "hints": "AcroSig:sig_patient;VendorText:DocuSign\\s+Envelope\\s+ID",
  "signatures": [
    {
      "page": 1,
      "field_name": "sig_patient",
      "role": "patient",
      "score": 5,
      "scores": { "field": 3, "page_label": 2 },
      "evidence": ["field:patient", "page_label:patient"],
      "hint": "AcroSig:sig_patient",
      "render_type": "typed",
      "bounding_box": [10.0, 10.0, 150.0, 40.0],
      "crop_path": "signature_crops/example/sig_01_patient.png",
      "crop_docx_path": null
    },
    {
      "page": null,
      "field_name": "vendor_or_acro_detected",
      "role": "representative",
      "score": 6,
      "scores": { "page_label": 4, "general": 2 },
      "evidence": ["page_label:representative(parent/guardian)", "pseudo:true"],
      "hint": "VendorOrAcroOnly",
      "render_type": "typed",
      "bounding_box": null,
      "crop_path": null
    }
  ]
}

Field notes

esign_found is true if any signature widget, AcroForm /Sig field, or vendor marker is detected.
scanned_pdf is a heuristic: pages with images only and no extractable text.
mixed means both esign_found and scanned_pdf are true.
roles summarizes unique non-unknown roles across signatures.
In retainer profile, emitter prefers two signatures (client + firm), often on the same page.
signatures[].bounding_box reports the widget rectangle in PDF points (origin bottom-left).
signatures[].crop_path is populated when PNG crops are generated (via CLI --crop-signatures or CropSignatureImages).
signatures[].crop_docx_path is populated when DOCX crops are generated (--crop-docx or docx=True).
signatures[].crop_bytes contains base64 PNG data when CLI --crop-bytes is enabled.
signatures[].crop_docx_bytes contains base64 DOCX data when --crop-docx and --crop-bytes are enabled together.

Configuration & rules

Built-in rules live under src/sigdetect/data/:

vendor_patterns.yml – vendor byte/text patterns (e.g., DocuSign, Acrobat Sign).
role_rules.yml – signer-role logic:
- labels – strong page labels (e.g., “Signature of Patient”, including Parent/Guardian cases)
- general – weaker role hints in surrounding text
- field_hints – field-name keywords (e.g., sig_patient)
- doc_hard – strong document-level triggers (relationship to patient, “minor/unable to sign”, first-person consent)
- weights – scoring weights for the above
role_rules.retainer.yml – retainer-specific rules (labels for client/firm, general tokens, and field hints).

You can keep one config YAML per dataset, e.g.:

# ./sample_data/config.yml (example)
pdf_root: ./pdfs
out_dir: ./sigdetect_out
engine: auto
write_results: false
pseudo_signatures: true
recurse_xobjects: true
profile: retainer    # or: hipaa
crop_signatures: false   # enable to write PNG crops (requires pymupdf)
crop_docx: false         # enable to write DOCX crops instead of PNGs (requires python-docx)
# crop_output_dir: ./signature_crops
crop_image_dpi: 200
detect_wet_signatures: false   # kept for compatibility; non-e-sign PDFs still trigger OCR
wet_ocr_dpi: 200
wet_ocr_languages: eng
wet_precision_threshold: 0.82

YAML files can be customized or load at runtime (see CLI --config, if available, or import and pass patterns into engine).

Key detection behaviors

Widget-first in mixed docs: if a real /Widget exists, no pseudo “VendorOrAcroOnly” signature is emitted.
Acro-only dedupe: multiple /Sig fields at the document level collapse to a single pseudo signature.
Parent/Guardian label: “Signature of Parent/Guardian” maps to the representative role.
Field-name fallbacks: role hints are pulled from /T, /TU, or /TM (in that order).
- Retainer heuristics:
- Looks for client and firm labels/tokens; boosts pages with law-firm markers (LLP/LLC/PA/PC) and “By:” blocks.
- Applies an anti-front-matter rule to reduce page-1 false positives (e.g., letterheads, firm mastheads).
- When only vendor/Acro clues exist (no widgets), it will emit two pseudo signatures targeting likely pages.
Wet detection (non-e-sign): The CLI runs an OCR-backed pass (PyMuPDF + pytesseract/Tesseract) after e-sign detection whenever no e-sign evidence is found. It emits RenderType="wet" signatures for high-confidence label/stroke pairs in the lower page region. When an image-based signature is present on a page, label-only OCR candidates are suppressed unless a stroke is detected. Results are deduped to the top signature per role (dropping unknown). Missing OCR dependencies add a ManualReview:* hint instead of failing.

Smoke tests

Drop-in smoke tests live under tests/ and cover:

Vendor-only (multiple markers)
Acro-only (single pseudo with multiple /Sig)
Mixed (real widget + vendor markers → widget role, no pseudo)
Field-name fallbacks (/TU, /TM)
Parent/Guardian label → representative
Encrypted PDFs (graceful handling)

Run a subset:

pytest -q -k smoke
# or specific files:
pytest -q tests/test_mixed_widget_vendor_smoke.py

Debugging

If you need to debug or inspect the detection logic, you can run the CLI with --debug:

from pathlib import Path
from sigdetect.config import DetectConfiguration
from sigdetect.detector.pypdf2_engine import PyPDF2Detector

pdf = Path("/path/to/one.pdf")
configuration = DetectConfiguration(
    PdfRoot=pdf.parent,
    OutputDirectory=Path("."),
    Engine="pypdf2",
    Profile="retainer",
    PseudoSignatures=True,
    RecurseXObjects=True,
)
print(PyPDF2Detector(configuration).Detect(pdf).to_dict())

Dev workflow

Project layout

src/
  sigdetect/
    detector/
      base.py
      pypdf2_engine.py
    data/
      role_rules.yml
      vendor_patterns.yml
    cli.py
tests/
pyproject.toml
.pre-commit-config.yaml

Formatting & linting (pre-commit)

# one-time
pip install pre-commit
pre-commit install

# run on all files
pre-commit run --all-files

Hooks: black, isort, ruff, plus pytest (optional).
Ensure your virtualenv folders are excluded in .pre-commit-config.yaml (e.g., ^\.venv).

Typical loop

# run tests
pytest -q

# run only smoke tests while iterating
pytest -q -k smoke

Troubleshooting

Using the wrong Python

which python
python -V

If you see 3.8 or system Python, recreate the venv with 3.11.

ModuleNotFoundError: typer / click / pytest

pip install typer click pytest

Pre-commit reformats files in .venv

exclude: |
  ^(\.venv|\.venv311|dist|build)/

Vendor markers not detected
Set --recurse-xobjects true and enable pseudo signatures. Many providers embed markers in Form XObjects or compressed streams.

Parent/Guardian not recognized
The rules already include a fallback for “Signature of Parent/Guardian”; if your variant differs, add it to role_rules.yml → labels.representative.

License

MIT

Project details

Release history Release notifications | RSS feed

0.5.5

Feb 25, 2026

0.5.4

Feb 19, 2026

0.5.3

Feb 4, 2026

0.5.2

Jan 29, 2026

This version

0.5.1

Jan 28, 2026

0.5.0

Jan 28, 2026

0.4.0

Dec 17, 2025

0.3.1

Dec 9, 2025

0.3.0

Dec 9, 2025

0.2.0

Nov 14, 2025

0.1.1

Nov 3, 2025

0.1.0

Nov 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigdetect-0.5.1.tar.gz (55.2 kB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sigdetect-0.5.1-py3-none-any.whl (51.0 kB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file sigdetect-0.5.1.tar.gz.

File metadata

Download URL: sigdetect-0.5.1.tar.gz
Upload date: Jan 28, 2026
Size: 55.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sigdetect-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`6648594e55b9fbe706e77ab4e36438f7c1615bbfc33b69284b91da21f0df6c05`
MD5	`1905923f8ba713543bac93b0f8dc5c10`
BLAKE2b-256	`0f006b9e2ba837fecc909b442c457b8acd718ee685a042633ff2202689e54fe7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sigdetect-0.5.1.tar.gz:

Publisher: release.yml on Angeion-Group/sigdetect-hipaa-retainer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sigdetect-0.5.1.tar.gz
- Subject digest: 6648594e55b9fbe706e77ab4e36438f7c1615bbfc33b69284b91da21f0df6c05
- Sigstore transparency entry: 868877815
- Sigstore integration time: Jan 28, 2026
Source repository:
- Permalink: Angeion-Group/sigdetect-hipaa-retainer@43d3521f565eb0548a84d8836f1a594d021df024
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/Angeion-Group
- Access: internal
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@43d3521f565eb0548a84d8836f1a594d021df024
- Trigger Event: push

File details

Details for the file sigdetect-0.5.1-py3-none-any.whl.

File metadata

Download URL: sigdetect-0.5.1-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 51.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sigdetect-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b9d33fd663c28aa12cb67c3406eac2f42b4a764f5626e0ca33542938396c76fb`
MD5	`e5c04e0beb943c2eb8a21ff789dd5632`
BLAKE2b-256	`23c4f73ca7e677368873a889279bd1b8d89bcf4b392a142cf8d909d9712e711e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sigdetect-0.5.1-py3-none-any.whl:

Publisher: release.yml on Angeion-Group/sigdetect-hipaa-retainer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sigdetect-0.5.1-py3-none-any.whl
- Subject digest: b9d33fd663c28aa12cb67c3406eac2f42b4a764f5626e0ca33542938396c76fb
- Sigstore transparency entry: 868877821
- Sigstore integration time: Jan 28, 2026
Source repository:
- Permalink: Angeion-Group/sigdetect-hipaa-retainer@43d3521f565eb0548a84d8836f1a594d021df024
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/Angeion-Group
- Access: internal
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@43d3521f565eb0548a84d8836f1a594d021df024
- Trigger Event: push

sigdetect 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

CaseWorks.Automation.CaseDocumentIntake

sigdetect

Contents

Quick start

Requirements

Setup

Sanity check

CLI usage

Detect (per-file summary)

Notes

EDA (quick aggregate stats)

Library usage

Library API (embed in another script)

Result schema

Field notes

Configuration & rules

Key detection behaviors

Smoke tests

Debugging

Dev workflow

Project layout

Formatting & linting (pre-commit)

Typical loop

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance