Skip to main content

A python package for using the CDISC TransCelerate USDM, version 4

Project description

USDM4

A Python library for the CDISC TransCelerate Unified Study Data Model (USDM) Version 4.

Overview

USDM4 provides tools for building, assembling, validating, converting, and expanding clinical study definitions using the USDM Version 4 specification. It enables programmatic creation and manipulation of machine-readable study definitions that conform to CDISC standards.

Features

  • Build - Create USDM4 study structures programmatically with a fluent builder interface
  • Assemble - Orchestrate complete study assembly from structured input data
  • Validate - Validate USDM4 JSON files via two complementary engines (the bundled d4k Python rule library and the CDISC CORE engine)
  • Load - Load USDM4 data from JSON files or dictionaries
  • Convert - Transform USDM data structures between formats
  • Expand - Expand schedule timelines for study designs

Installation

pip install usdm4

Requirements

  • Python 3.12 or higher (required by the cdisc-rules-engine dependency)

Quick Start

from usdm4 import USDM4
from simple_error_log.errors import Errors

# Initialize
usdm = USDM4()
errors = Errors()

# Create a minimal study
wrapper = usdm.minimum("My Study", "SPONSOR-001", "1.0", errors)

# Access the study
print(wrapper.study.id)

Usage

Loading Studies

Load a study from a JSON file:

errors = Errors()
wrapper = USDM4().load("study.json", errors)

Load from a dictionary:

data = {...}
wrapper = USDM4().loadd(data, errors)

Validating Studies

USDM4 provides two complementary validation engines, both invoked from the USDM4 facade.

The d4k engine — usdm4's own Python rule library — runs the V4 DDF rule catalogue. It is fast enough for tight feedback loops and has no external dependencies:

result = USDM4().validate("study.json")

if result.passed_or_not_implemented():
    print("Validation passed")
else:
    print("Validation failed")

The CDISC CORE engine — wrapping the cdisc-rules-engine package — runs the same catalogue in JSONata against authoritative CDISC sources. Use it as an independent cross-check; it needs a CDISC Library API key and downloads rule definitions on first use:

import os
os.environ["CDISC_LIBRARY_API_KEY"] = "your-api-key-here"

result = USDM4().validate_core("study.json")

if result.is_valid:
    print("CORE validation passed")
else:
    print(result.format_text())

To pre-populate the CORE cache (useful at server startup or before running in offline environments):

USDM4().prepare_core()

For running both engines and aligning their per-rule results from the command line, see validate/README.md.

CORE validation API

For more control, use CoreValidator directly without the USDM4 facade:

from usdm4.core import CoreValidator

validator = CoreValidator(
    cache_dir="/path/to/my/cache",
    api_key="my-api-key",
)
result = validator.validate("study.json", version="4-0")

validate_core(file_path, version="4-0", cache_dir=None, api_key=None) parameters:

  • file_path — Path to the USDM JSON file.
  • version"3-0" or "4-0" (default "4-0").
  • cache_dir — Optional path to the cache directory. Defaults to a platform-appropriate location via platformdirs (see "CORE validation cache" below).
  • api_key — Optional CDISC Library API key. Falls back to CDISC_LIBRARY_API_KEY or CDISC_API_KEY environment variables.

CoreValidationResult properties:

  • is_validTrue if no validation findings were reported.
  • finding_count — Total number of individual validation errors across all findings.
  • execution_error_count — Number of rule execution errors (rules that don't apply to this file).
  • rules_executed — Total rules that were run.
  • rules_skipped — Rules skipped due to known engine bugs (see docs/cre_issues.md).
  • ct_packages_available — Number of CT packages known to CDISC Library.
  • ct_packages_loaded — List of CT package names loaded for this file.
  • findings — List of CoreRuleFinding objects.

CoreValidationResult methods:

  • format_text() — Human-readable text report.
  • to_dict() — JSON-serialisable dictionary.

CoreRuleFinding — one rule that reported errors:

  • rule_id — The CORE rule identifier (e.g. "CORE-000996").
  • description — Human-readable description of what the rule checks.
  • message — Error message template from the rule.
  • errors — List of error detail dicts.
  • error_count — Number of errors for this rule.

CoreCacheManager — accessed via validator.cache_manager:

  • cache_dir — The root cache directory path.
  • clear() — Remove all cached resources; they will re-download on next use.
  • ensure_resources() — Download JSONata and XSD schema files if not already cached.

CORE validation cache

The module uses a three-level caching strategy: persistent disk cache, an in-memory cache used by the engine within a single process, and remote download from the CDISC Library on cache miss.

Resource Location Source on first run
Validation rules {cache_dir}/rules/usdm/4-0.json CDISC Library API
CT package list {cache_dir}/ct/published_packages.json CDISC Library API
CT codelist data {cache_dir}/ct/data/{package}.json CDISC Library API
JSONata functions {cache_dir}/resources/jsonata/ GitHub (cdisc-rules-engine repo)
XSD schemas {cache_dir}/resources/schema/xml/ GitHub (cdisc-rules-engine repo)

The default cache_dir is platform-appropriate, resolved via platformdirs:

  • macOS: ~/Library/Caches/usdm4/core/
  • Windows: %LOCALAPPDATA%/usdm4/Cache/core/
  • Linux: ~/.cache/usdm4/core/

For web-server deployments, pass an explicit cache_dir to USDM4(cache_dir=...) or CoreValidator(cache_dir=...). To force a fresh download:

from usdm4.core import CoreValidator
CoreValidator().cache_manager.clear()

Troubleshooting CORE validation

  • "No CDISC API key" — Set CDISC_API_KEY or CDISC_LIBRARY_API_KEY in the environment.
  • Slow first run — The first validation downloads rules, CT packages, and schema files. Subsequent runs use the cache.
  • CT validation failures — Check that codeSystemVersion values in your USDM JSON correspond to published CT packages. result.ct_packages_loaded shows which packages were loaded.
  • Stale cache — If rules or CT packages have been updated upstream, clear the cache with validator.cache_manager.clear().

Engine bugs and workarounds are catalogued in docs/cre_issues.md.

CORE validation references

Building Studies

Use the builder for programmatic study creation with access to controlled terminology:

errors = Errors()
builder = USDM4().builder(errors)

# Get CDISC codes
code = builder.cdisc_code("C207616", "Official Study Title")

# Get ISO codes
country = builder.iso3166_code("USA")
language = builder.iso639_code("en")

# Create organizations
sponsor = builder.sponsor("My Pharma Corp")

# Create any USDM4 class
study_version = builder.create("StudyVersion", {"versionNumber": "1.0"})

Assembling Studies

For structured assembly of complete studies from domain-organized input:

errors = Errors()
assembler = USDM4().assembler(errors)

assembler.execute({
    "identification": {...},
    "document": {...},
    "population": {...},
    "study_design": {...},
    "amendments": {...},
    "study": {...}
})

wrapper = assembler.wrapper("MySystem", "1.0")

Assembler JSON Input Structure

The assembler accepts a single dictionary with the following top-level keys, each processed by a dedicated sub-assembler:

{
  "identification": { ... },
  "document": { ... },
  "population": { ... },
  "amendments": { ... },
  "study_design": { ... },
  "soa": { ... },
  "study": { ... }
}

All top-level keys are required except soa, which is optional.


identification

Study identification, titles, identifiers, organizations, and roles.

{
  "titles": {
    "brief": "string",
    "official": "string",
    "public": "string",
    "scientific": "string",
    "acronym": "string"
  },
  "identifiers": [
    {
      "identifier": "string",
      "scope": {
        "standard": "string",
        "non_standard": {
          "type": "string",
          "role": "string | null",
          "name": "string",
          "description": "string",
          "label": "string",
          "identifier": "string",
          "identifierScheme": "string",
          "legalAddress": {
            "lines": ["string"],
            "city": "string",
            "district": "string",
            "state": "string",
            "postalCode": "string",
            "country": "string"
          }
        }
      }
    }
  ],
  "roles": {
    "co_sponsor": {
      "name": "string",
      "address": {
        "lines": ["string"],
        "city": "string",
        "district": "string",
        "state": "string",
        "postalCode": "string",
        "country": "string"
      }
    },
    "local_sponsor": { },
    "device_manufacturer": { }
  },
  "other": {
    "sponsor_signatory": "string | null",
    "medical_expert": "string | null",
    "compound_names": "string | null",
    "compound_codes": "string | null"
  }
}

Notes:

  • titles is optional (defaults to empty). Valid title types: brief, official, public, scientific, acronym.
  • identifiers is optional (defaults to empty list). Each identifier scope must contain either standard or non_standard, not both.
  • Valid standard keys: ct.gov, ema, fda. These resolve to predefined organizations with complete address information.
  • Valid non_standard type values: registry, regulator, healthcare, pharma, lab, cro, gov, academic, medical_device.
  • Valid role values: co-sponsor, manufacturer, investigator, pharmacovigilance, project manager, local sponsor, laboratory, study subject, medical expert, statistician, idmc, care provider, principal investigator, outcomes assessor, dec, clinical trial physician, sponsor, adjudication committee, study site, dsmb, regulatory agency, contract research.
  • roles is optional (defaults to empty). Each role key (co_sponsor, local_sponsor, device_manufacturer) can be null to skip. The address field within each role is optional.
  • other is optional. When present, all four sub-fields are read directly.

document

Protocol document metadata and hierarchical content sections.

{
  "document": {
    "label": "string",
    "version": "string",
    "status": "string",
    "template": "string",
    "version_date": "string"
  },
  "sections": [
    {
      "section_number": "string",
      "section_title": "string",
      "text": "string"
    }
  ]
}

Notes:

  • All fields in document are required.
  • Valid status values: APPROVED, DRAFT, DFT, FINAL, OBSOLETE, PENDING, PENDING REVIEW (case-insensitive).
  • version_date should be in ISO format (e.g. 2024-01-15).
  • Section hierarchy is determined by section_number depth: "1" = level 1, "1.1" = level 2, "1.1.1" = level 3.
  • text content may contain HTML.

population

Population definitions and eligibility criteria.

{
  "label": "string",
  "inclusion_exclusion": {
    "inclusion": ["string"],
    "exclusion": ["string"]
  }
}

Notes:

  • All fields are required.
  • Each inclusion and exclusion item is a text string describing the criterion.
  • The label is used to generate the internal name (uppercased, spaces replaced with hyphens).

amendments

Study amendment information. Can be null or empty to skip amendment processing entirely.

{
  "identifier": "string",
  "summary": "string",
  "reasons": {
    "primary": "string",
    "secondary": "string"
  },
  "impact": {
    "safety_and_rights": {
      "safety": { "substantial": boolean, "reason": "string" },
      "rights": { "substantial": boolean, "reason": "string" }
    },
    "reliability_and_robustness": {
      "reliability": { "substantial": boolean, "reason": "string" },
      "robustness": { "substantial": boolean, "reason": "string" }
    }
  },
  "enrollment": {
    "value": "integer | string",
    "unit": "string"
  },
  "scope": {
    "global": boolean,
    "countries": ["string"],
    "regions": ["string"],
    "sites": ["string"],
    "unknown": ["string"]
  },
  "changes": [
    {
      "section": "string",
      "description": "string",
      "rationale": "string"
    }
  ]
}

Notes:

  • reasons values use CODE:DECODE format (e.g. "C207609:New Safety Information Available").
  • Valid reason codes: C207612 (Regulatory Agency Request), C207608 (New Regulatory Guidance), C207605 (IRB/IEC Feedback), C207609 (New Safety Information), C207606 (Manufacturing Change), C207602 (IMP Addition), C207601 (Change In Strategy), C207600 (Change In Standard Of Care), C207607 (New Data Available), C207604 (Investigator/Site Feedback), C207611 (Recruitment Difficulty), C207603 (Inconsistency/Error In Protocol), C207610 (Protocol Design Error), C17649 (Other), C48660 (Not Applicable).
  • enrollment is optional. The value is converted to integer internally.
  • scope is optional. Items in unknown are resolved to country or region codes via ISO 3166 lookup. Empty strings in unknown are skipped.
  • changes section references use "NUMBER, TITLE" format (e.g. "1.5, Safety Considerations"), which are matched against document sections.

study_design

Study design structure and trial phase.

{
  "label": "string",
  "rationale": "string",
  "trial_phase": "string"
}

Notes:

  • All fields are required.
  • Valid trial_phase values: 0, PRE-CLINICAL, 1, I, 1-2, 1/2, 1/2/3, 1/3, 1A, IA, 1B, IB, 2, II, 2-3, II-III, 2A, IIA, 2B, IIB, 3, III, 3A, IIIA, 3B, IIIB, 4, IV, 5, V, 2/3/4. Prefixes PHASE or TRIAL are automatically stripped.
  • Default intervention model is Parallel Study (CDISC code C82639).

soa (Schedule of Activities)

Timeline data including epochs, visits, timepoints, activities, and conditions. This entire section is optional.

{
  "epochs": {
    "items": [
      { "text": "string" }
    ]
  },
  "visits": {
    "items": [
      {
        "text": "string",
        "references": ["string"]
      }
    ]
  },
  "timepoints": {
    "items": [
      {
        "index": "string | integer",
        "text": "string",
        "value": "string | integer",
        "unit": "string"
      }
    ]
  },
  "windows": {
    "items": [
      {
        "before": integer,
        "after": integer,
        "unit": "string"
      }
    ]
  },
  "activities": {
    "items": [
      {
        "name": "string",
        "visits": [
          {
            "index": integer,
            "references": ["string"]
          }
        ],
        "children": [
          {
            "name": "string",
            "visits": [
              {
                "index": integer,
                "references": ["string"]
              }
            ],
            "actions": {
              "bcs": ["string"]
            }
          }
        ],
        "actions": {
          "bcs": ["string"]
        }
      }
    ]
  },
  "conditions": {
    "items": [
      {
        "reference": "string",
        "text": "string"
      }
    ]
  }
}

Notes:

  • Epochs, visits, and timepoints arrays must be parallel (same length, aligned by index).
  • windows must also be parallel with timepoints.
  • Negative timepoint value indicates before the reference anchor. The first non-negative value determines the anchor point.
  • references on visits and activities are condition keys that link to entries in the conditions array.
  • children are sub-activities nested under a parent activity.
  • actions.bcs lists Biomedical Concept names. Known concepts are resolved from the CDISC BC library; unknown names create surrogate BiomedicalConcept objects.
  • Supported time units: years/yrs/yr, months/mths/mth, weeks/wks/wk, days/dys/dy, hours/hrs/hr, minutes/mins/min, seconds/secs/sec (case-insensitive).

study

Core study information and metadata.

{
  "name": {
    "identifier": "string",
    "acronym": "string",
    "compound": "string"
  },
  "label": "string",
  "version": "string",
  "rationale": "string",
  "description": "string",
  "sponsor_approval_date": "string",
  "confidentiality": "string",
  "original_protocol": "string | boolean"
}

Notes:

  • name is required. At least one of identifier, acronym, or compound must be non-empty. Priority order: identifier > acronym > compound. The name is auto-generated (uppercased, non-alphanumeric characters removed).
  • version and rationale are required.
  • label is optional; used as fallback if name generation produces an empty string.
  • description, sponsor_approval_date, confidentiality, and original_protocol are all optional.
  • original_protocol is converted to boolean: "true", "1", "yes", "y" map to true (case-insensitive).
  • sponsor_approval_date should be in ISO format (e.g. 2024-01-15).
  • When present, confidentiality, original_protocol, compound_codes, compound_names, sponsor_signatory, and medical_expert are stored as extension attributes on the study version.

Converting Studies

converter = USDM4().convert()
# Transform data structures as needed

Expanding Timelines

expander = USDM4().expander(wrapper)
# Process schedule timeline expansion

API Classes

Domain model classes are organised by area:

Domain Classes
Study Structure Study, StudyVersion, StudyDesign, StudyArm, StudyEpoch, StudyElement
Interventions StudyIntervention, Activity, Administration, Procedure, Encounter
Population StudyDesignPopulation, AnalysisPopulation, EligibilityCriterion, SubjectEnrollment
Documents StudyDefinitionDocument, StudyDefinitionDocumentVersion, Amendment
Coding Code, AliasCode, BiomedicalConcept, Objective, Endpoint
Timelines ScheduleTimeline, ScheduledActivityInstance, ScheduledDecisionInstance
Organization StudyIdentifier, Organization, StudySite

Development

Running Tests

pytest

Tests require 100% code coverage.

Code Formatting

ruff format
ruff check

Building the Package

python3 -m build --sdist --wheel

Publishing

twine upload dist/*

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

usdm4-0.24.0.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

usdm4-0.24.0-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file usdm4-0.24.0.tar.gz.

File metadata

  • Download URL: usdm4-0.24.0.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for usdm4-0.24.0.tar.gz
Algorithm Hash digest
SHA256 14e097c01131233446ba52f86c8aa8f211cb9a8c8fddb9c2e506d293037bb3f2
MD5 74d5cc6a182e7e0fdc81ceb1ffe7ed41
BLAKE2b-256 f5c10f5d27477dcc02585772d6378814eaca40fb90a5a4a2d57d4aa2c909dfea

See more details on using hashes here.

File details

Details for the file usdm4-0.24.0-py3-none-any.whl.

File metadata

  • Download URL: usdm4-0.24.0-py3-none-any.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for usdm4-0.24.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d054ac5144b3dc15649e21d4b1f1d0d339d4792fc2b988c16bf8fb4f4ff34ad1
MD5 429397f622d52caa6d0f3cde65918068
BLAKE2b-256 b16ad3320cadaedd688f1dbeef0755f3331fd561fffbd047933ffe6e8347eb56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page