USDM4 Protocol Import - M11, CPT and Legacy formats

These details have not been verified by PyPI

Project description

usdm4_protocol

A unified Python package for importing clinical trial protocol documents into USDM v4 (Unified Study Definitions Model). Converts protocol documents from three industry formats into a structured, standards-compliant USDM JSON representation.

Supported Formats

ICH M11 — The ICH M11 guideline clinical electronic Structured Protocol Template. Accepts .docx files following the M11 section structure. Supports optional AI-assisted extraction via the Anthropic Claude API.

TransCelerate CPT — The Common Protocol Template defined by TransCelerate BioPharma. Accepts .docx files in CPT format.

Legacy PDF — Freeform sponsor protocol documents in PDF format. Uses HTML-based extraction with a pluggable PDF converter backend (docling or PyMuPDF).

Installation

The package requires Python 3.12 or later.

# Core package (M11 and CPT DOCX import)
pip install usdm4_protocol

# With lightweight PDF support (~20 MB, suitable for Docker/Fly.io)
pip install usdm4_protocol[pdf]

# With high-accuracy PDF support via docling (large, includes ML models)
pip install usdm4_protocol[pdf-docling]

# With AI-assisted extraction (Anthropic Claude)
pip install usdm4_protocol[ai]

# With AI-assisted extraction (Google Gemini)
pip install usdm4_protocol[ai-gemini]

# Everything
pip install usdm4_protocol[all]

Quick Start

Unified Entry Point

The USDM4Protocol class provides a single interface across all formats:

from usdm4_protocol import USDM4Protocol

protocol = USDM4Protocol()

# Import an M11 DOCX
wrapper = protocol.from_m11("path/to/protocol_m11.docx")

# Import a CPT DOCX
wrapper = protocol.from_cpt("path/to/protocol_cpt.docx")

# Import a legacy PDF
wrapper = protocol.from_pdf("path/to/protocol.pdf")

# Auto-detect format from file extension
wrapper = protocol.from_file("path/to/protocol.docx")

# Access the USDM JSON
json_str = wrapper.to_json()

# Check for errors
print(protocol.errors.dump(0))

Format-Specific Handlers

Each format also has its own handler class for direct use:

from usdm4_protocol.m11 import USDM4M11
from usdm4_protocol.cpt import USDM4CPT
from usdm4_protocol.legacy import USDM4Legacy

# M11 with AI-assisted extraction
m11 = USDM4M11()
wrapper = m11.from_docx("protocol.docx", use_ai=True)

# CPT
cpt = USDM4CPT()
wrapper = cpt.from_docx("protocol.docx")

# Legacy PDF with explicit converter choice
legacy = USDM4Legacy()
wrapper = legacy.from_pdf("protocol.pdf", pdf_converter="pymupdf")

Exporting

Convert USDM JSON back to an HTML document view:

protocol = USDM4Protocol()

# Export as M11-structured HTML
html = protocol.to_html("usdm_output.json", template="M11")

# Export as CPT-structured HTML
html = protocol.to_html("usdm_output.json", template="CPT")

# Generate data views (title page, etc.)
views = protocol.data_views("usdm_output.json")

PDF Converter Options

The legacy PDF handler supports two backends for converting PDF to HTML. The converter is selected via the pdf_converter parameter.

Converter	Install Extra	Size	Best For
PyMuPDF	`pdf`	~20 MB	Docker deployments, Fly.io, lightweight environments
docling	`pdf-docling`	~2 GB+	Maximum accuracy, complex table extraction, GPU-accelerated environments

In "auto" mode (the default), docling is preferred when available, falling back to PyMuPDF. Both converters produce HTML output that feeds into the same downstream extraction pipeline.

# Explicit selection
wrapper = protocol.from_pdf("protocol.pdf", pdf_converter="pymupdf")
wrapper = protocol.from_pdf("protocol.pdf", pdf_converter="docling")

# Auto-select best available (default)
wrapper = protocol.from_pdf("protocol.pdf")

Processing Pipeline

All three handlers follow the same internal pipeline, implemented in a shared BaseImport base class:

Load → Extract → Assemble → Wrapper

Load — Reads the source document and converts it to a normalised HTML representation. For DOCX formats this uses raw_docx; for PDF it uses the selected converter (docling or PyMuPDF). The legacy PDF path also cleans (removes table of contents) and splits the HTML into logical sections.
Extract — Parses the HTML to identify structured data: title page fields, sponsor information, study design, amendments, inclusion/exclusion criteria, Schedule of Activities tables, and section content. Extraction uses a two-layer architecture:
- Common layer (common/extract/): Format-agnostic extractors including AI-assisted extraction (ContentExtractor, IEExtractor), combined AI + heuristic row classification (CombinedRowClassifier), and section discovery utilities.
- Format layer (e.g., m11/import_/extract/): Format-specific code that locates sections and delegates to common extractors.
Assemble — Maps the extracted data into USDM v4 model objects via usdm4, producing a Wrapper instance containing the complete study definition.

Schedule of Activities (SoA) Extraction

The SoA extraction pipeline converts protocol schedule tables into USDM4 timeline, epoch, encounter, and activity objects. It runs 8 sequential feature extractors on each SoA table: ActivityRow, Notes, Timepoints, Epochs, Visits, Windows, Activities, and Conditions.

A CombinedRowClassifier runs both an AI classifier (single Claude API call) and a heuristic classifier against the table's header rows, then merges their results using consensus logic. Where both agree, the result is used directly. Where they conflict, the AI classification is preferred. The merged classification provides row-type hints to each feature extractor, so they can target the correct row without re-scanning.

SoA section discovery (for legacy PDFs) uses a multi-pass strategy with 10 search terms, exhaustive matching across all candidate sections, and OCR-tolerant fallback. Sections containing <table> elements are preferred over those without. When a matched section has no table, child subsections are checked automatically.

When no epoch row exists in the SoA table but timepoints are present, a default "Study Period" epoch is synthesised automatically, allowing the assembler to build a schedule timeline even for simple tables.

Package Structure

src/usdm4_protocol/
├── __init__.py              # USDM4Protocol unified entry point
├── common/                  # Shared utilities
│   ├── ai/                  # AI providers (Claude, fallback)
│   ├── assemble/            # USDM assembly logic
│   ├── base_import.py       # BaseImport — shared Load→Extract→Assemble pipeline
│   ├── extract/             # Common extractors (IE criteria, row classifiers, section finder)
│   ├── html/                # HTML table expansion, BeautifulSoup helpers
│   └── load/                # Shared load utilities
├── m11/                     # ICH M11 handler
│   ├── import_/             # Load DOCX → Extract → Assemble
│   ├── export/              # USDM → HTML export
│   ├── specification/       # M11 section definitions (YAML)
│   ├── elements/            # M11 element definitions
│   └── views/               # Document and data views
├── cpt/                     # TransCelerate CPT handler
│   ├── import_/             # Load DOCX → Extract → Assemble
│   └── views/               # Document views
├── legacy/                  # Legacy PDF handler
│   └── import_/
│       ├── load/            # PDF → HTML conversion
│       │   ├── to_html.py          # Factory selecting converter backend
│       │   ├── to_html_base.py     # Abstract base class
│       │   ├── to_html_docling.py  # Docling implementation
│       │   ├── to_html_pymupdf.py  # PyMuPDF implementation
│       │   ├── clean_html.py       # HTML normalisation (TOC removal)
│       │   └── split_html.py       # Section splitting by numbered headings
│       └── extract/         # HTML → structured data
└── soa/                     # Schedule of Activities extractor
    ├── soa_extractor.py     # Main SoA extraction orchestrator
    ├── soa_model.py         # SoA data → HTML table generation
    └── features/            # Feature extractors (epochs, visits, timepoints,
                             # activities, windows, conditions, notes) and
                             # heuristic row classifier

Development

# Clone and install with dev dependencies
git clone https://github.com/data4knowledge/usdm4_protocol.git
cd usdm4_protocol
pip install -e ".[dev,ai,pdf]"

# Run tests
pytest

# Lint
ruff check src/ tests/

Test Structure

Tests mirror the source layout under tests/:

tests/m11/ — Extraction, assembly, export, specifications, views
tests/soa/ — Schedule of Activities feature extraction, model generation, row classification
tests/common/ — Shared extractors, AI providers, HTML utilities, combined row classifier
tests/legacy/ — PDF loading, HTML cleaning/splitting, IE extraction, USDM integration
tests/cpt/ — CPT import, title page extraction, document views

Integration tests (in test_integration.py and test_*_ai_integration.py files) require real test files and/or API keys. They are marked with @pytest.mark.integration and @pytest.mark.ai respectively. A .test_env file is loaded at test collection time via conftest.py to provide environment variables such as ANTHROPIC_API_KEY and GEMINI_API_KEY. Integration tests use session-scoped caching to avoid redundant protocol loading across test methods.

Dependencies

Core: usdm4, raw_docx, simple_error_log, beautifulsoup4, python-dateutil

Optional: pymupdf (lightweight PDF), docling (high-accuracy PDF), anthropic + d4k_ms_base (Claude AI extraction), google-genai + d4k_ms_base (Gemini AI extraction)

Testing

Commands

The default addopts in pyproject.toml excludes integration and ai markers and enables coverage, so plain pytest runs unit tests with coverage reporting.

Command	What runs	Speed
`pytest`	Unit tests only (default excludes integration + ai)	Fast (seconds)
`pytest -m "not ai"`	Unit + integration tests, no API calls	Medium
`pytest -m ""`	Full suite including AI integration tests	Slow (requires API keys)

Coverage

Coverage is enabled by default via addopts in pyproject.toml. To run without coverage:

pytest -o "addopts=-m 'not integration and not ai'"

Building the Package

python3 -m build --sdist --wheel

Publishing

twine upload dist/*

License

AGPL-3.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.1

May 4, 2026

This version

0.6.0

May 4, 2026

0.5.0

Apr 30, 2026

0.4.0

Apr 26, 2026

0.3.0

Apr 10, 2026

0.2.0

Apr 5, 2026

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usdm4_protocol-0.6.0.tar.gz (50.1 MB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

usdm4_protocol-0.6.0-py3-none-any.whl (250.0 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file usdm4_protocol-0.6.0.tar.gz.

File metadata

Download URL: usdm4_protocol-0.6.0.tar.gz
Upload date: May 4, 2026
Size: 50.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for usdm4_protocol-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`9bc334e3e33719e2bb136ce9ed60f2937675f295d87a96e2a0c143628c8ef894`
MD5	`d9d1640ef4d49dcabf7cbcaf00236d32`
BLAKE2b-256	`dae5550e88ecc96300e758768cf3970b7cdc6e28b32ec30d1bda0ed63f31795e`

See more details on using hashes here.

File details

Details for the file usdm4_protocol-0.6.0-py3-none-any.whl.

File metadata

Download URL: usdm4_protocol-0.6.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 250.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for usdm4_protocol-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba3d53f8aab311a43cc399744e86af2a4a2fcf6e6d5f71f8d1bd9c279a4917b8`
MD5	`877bd6e481a87637baf6ef9621c31a54`
BLAKE2b-256	`eae3693be3c4b13083d60e0f39f52fefef8be6d4a84266baa6b5cdf8bf6dce75`

See more details on using hashes here.

usdm4-protocol 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

usdm4_protocol

Supported Formats

Installation

Quick Start

Unified Entry Point

Format-Specific Handlers

Exporting

PDF Converter Options

Processing Pipeline

Schedule of Activities (SoA) Extraction

Package Structure

Development

Test Structure

Dependencies

Testing

Commands

Coverage

Building the Package

Publishing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes