Skip to main content

Python port of TrialEmulation R package for causal analysis of observational time-to-event data

Project description

TrialEmulation (Python)

License Python Build Status Code Coverage

Python port of the R TrialEmulation package for causal inference in observational time-to-event data.

Overview

The target trial emulation framework provides a principled approach to causal inference from observational data by explicitly specifying the hypothetical randomized trial (the "target trial") that would answer the research question of interest. This package implements methods to emulate such target trials using observational data in the person-time format, commonly found in electronic health records and administrative databases.

The core methodology involves expanding longitudinal data into a sequence of "nested" trials, where each eligible person-time contributes to one or more emulated trials depending on when they became eligible for treatment. Marginal structural models (MSMs) are then used to estimate treatment effects while accounting for time-varying confounding through inverse probability weighting.

This Python implementation provides the same functionality as the R TrialEmulation package, with a focus on performance through C++ extensions for computationally intensive operations, comprehensive type hints for better IDE support, and integration with the scientific Python ecosystem (pandas, NumPy, statsmodels).

Key Features

  • Target Trial Emulation: Expand observational person-time data into a sequence of emulated trials
  • Multiple Estimands: Support for intention-to-treat (ITT), per-protocol (PP), and as-treated analyses
  • Inverse Probability Weighting (IPW):
    • Treatment switching weights for per-protocol and as-treated estimands
    • Censoring weights to account for informative censoring
    • Flexible covariate specification for numerator and denominator models
  • Marginal Structural Models: Fit weighted pooled logistic regression for time-to-event outcomes
  • Robust Variance Estimation: Cluster-robust standard errors using sandwich estimators
  • Case-Control Sampling: Efficient sampling of controls for large datasets
  • Performance Optimized: C++ extensions (pybind11) for computationally intensive operations
  • Type Safety: Comprehensive type hints throughout the codebase
  • Flexible Interface: Support for R-style formulas and Python-style specifications

Installation

Note: This package is not yet published to PyPI. Currently install from source.

From Source (Current Method)

git clone https://github.com/csainsbury/Trial-Emulation-R2P.git
cd Trial-Emulation-R2P
pip install -e .

Development Installation

To install with development dependencies (testing, linting):

pip install -e ".[dev]"

Requirements

  • Python 3.10 or higher
  • NumPy >= 1.20.0
  • pandas >= 1.3.0
  • statsmodels >= 0.13.0
  • scipy >= 1.7.0
  • pybind11 >= 2.10.0 (for C++ extensions)

Quick Start

Here's a minimal example demonstrating the core workflow:

import trial_emulation as te
import pandas as pd

# Load your data in long (person-time) format
# Required columns: id, period, treatment, outcome, eligible
data = pd.read_csv("your_data.csv")

# Step 1: Prepare data for trial emulation
# This expands the data into a sequence of emulated trials and calculates weights
prep = te.data_preparation(
    data=data,
    id="id",                    # Patient identifier
    period="period",            # Time period
    treatment="treatment",      # Treatment indicator (0/1)
    outcome="outcome",          # Outcome event indicator (0/1)
    eligible="eligible",        # Eligibility indicator (0/1)
    estimand_type="ITT",        # Intention-to-treat analysis
    outcome_cov=["age", "sex", "comorbidities"],  # Time-varying covariates
    use_censor_weights=True,    # Use censoring weights
    cense="censored",           # Censoring indicator
)

# Step 2: Fit marginal structural model
# This fits a weighted pooled logistic regression with robust standard errors
msm = te.trial_msm(
    data=prep,                   # Prepared data from step 1
    outcome_cov=["age", "sex"],  # Covariates to adjust for
    estimand_type="ITT",
)

# Step 3: View results with robust standard errors
print(msm.robust["summary"])

# Extract treatment effect estimate
treatment_effect = msm.model.params["assigned_treatment"]
robust_se = msm.robust["bse"]["assigned_treatment"]
print(f"Treatment effect: {treatment_effect:.3f} (SE: {robust_se:.3f})")

Understanding the Results

The MSM provides:

  • Coefficients: Log odds ratios for the effect of treatment on the outcome
  • Robust Standard Errors: Account for clustering by patient ID
  • Confidence Intervals: Based on robust standard errors
  • Model Summary: Full regression output with all covariates

For ITT analyses, the coefficient for assigned_treatment represents the effect of being assigned to treatment at trial baseline, regardless of subsequent adherence.

Data Format

Your input data should be in long (person-time) format with one row per person-period:

id period treatment outcome eligible age ...
1 0 0 0 1 45 ...
1 1 1 0 1 45 ...
1 2 1 0 1 45 ...
2 0 0 0 1 52 ...
2 1 0 1 1 52 ...

Key requirements:

  • Unique identifier (id): Patient or unit identifier
  • Time period (period): Sequential integer starting at 0
  • Treatment (treatment): Binary indicator (0 = untreated, 1 = treated)
  • Outcome (outcome): Binary indicator for the event of interest
  • Eligibility (eligible): Indicator for whether the person is eligible at that time
  • Covariates: Time-varying or baseline characteristics

Documentation

Full documentation is available at https://trial-emulation.readthedocs.io

Additional Resources

Examples

See the examples/ directory for complete working examples:

  • basic_usage.py - Simple ITT analysis workflow
  • itt_analysis.py - Detailed intention-to-treat example
  • per_protocol_analysis.py - Per-protocol analysis with artificial censoring

Citation

If you use this package in your research, please cite:

@software{trial_emulation_python,
  title = {TrialEmulation: Target Trial Emulation for Causal Inference},
  author = {Vermeulen, Xander and Sainsbury, Chris},
  year = {2026},
  url = {https://github.com/csainsbury/Trial-Emulation-R2P},
  version = {0.0.4.0}
}

For the methodology, please also cite the original R package and key papers:

  • Danaei G, García Rodríguez LA, Cantero OF, Logan RW, Hernán MA. Observational data for comparative effectiveness research: An emulation of randomised trials of statins and primary prevention of coronary heart disease. Statistical Methods in Medical Research. 2013;22(1):70-96.

  • Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology. 2016;183(8):758-764.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to:

  • Report bugs
  • Suggest features
  • Submit pull requests
  • Set up a development environment

For major changes, please open an issue first to discuss what you would like to change.

Development Status

This package is in alpha (v0.0.x series). The API may change as we gather user feedback. The package is functional and tested internally but should be considered experimental for production use.

Current priorities:

  • Expanding test coverage
  • Adding more examples and tutorials
  • Performance optimization
  • Validation against R package results

Testing & Validation

Test Data Sources

The package test suite uses validated data from the original R TrialEmulation package to ensure compatibility and correctness. Specifically, we use the trial_example dataset which contains:

  • 48,400 observations across 503 patients
  • Longitudinal structure with realistic treatment patterns
  • Known to work correctly with target trial emulation methods
  • Same data used in the R package documentation and vignettes

This approach ensures that our Python implementation produces results consistent with the established R implementation.

Test Coverage

Current test status (as of v0.0.4.0):

  • 39 tests passing + 1 skipped (Python 3.14 compatibility) - all core functionality covered
  • 27% code coverage across main modules (focused on core workflow validation)
  • Integration tests pass with real R package data
  • Core workflow validated: data_preparation()trial_msm() → results
  • Proven to work on real epidemiological data (1.9M+ observations after expansion)
  • Multi-platform testing via GitHub Actions (Linux, macOS, Windows)

Key tested functionality:

  • Data preparation and trial expansion
  • Multiple estimand types (ITT, PP, As-Treated)
  • Inverse probability weighting (treatment and censoring)
  • Marginal structural model fitting
  • Robust variance estimation
  • Various analysis options (weight truncation, period filtering, etc.)

Running Tests

To run the test suite:

# Install development dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/

# Run with coverage report
pytest tests/ --cov=trial_emulation --cov-report=html

# Run only integration tests
pytest tests/ -m integration

The test suite includes:

  • Unit tests: Individual function and module testing
  • Integration tests: End-to-end workflow validation with real data
  • Edge case tests: Handling of unusual inputs and boundary conditions

Validation Against R Package

The Python implementation has been validated against the R package using:

  1. Same example datasets (trial_example)
  2. Comparison of key outputs (expanded data structure, weight calculations)
  3. Integration tests that verify the complete workflow produces reasonable results

While minor numerical differences may exist due to differences in optimization algorithms and random number generation, the overall methodology and results are consistent with the R implementation.

Acknowledgments

This Python implementation is based on the R TrialEmulation package developed by the Causal-LDA team. We thank the original developers for their methodological contributions and open-source implementation.

The target trial emulation framework is built on seminal work by:

  • Miguel Hernán and James Robins (Harvard T.H. Chan School of Public Health)
  • The CAUSALab team

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trial_emulation-0.0.4.0.tar.gz (46.2 kB view details)

Uploaded Source

File details

Details for the file trial_emulation-0.0.4.0.tar.gz.

File metadata

  • Download URL: trial_emulation-0.0.4.0.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for trial_emulation-0.0.4.0.tar.gz
Algorithm Hash digest
SHA256 39656c79763f61df4b557ca4f264c27f5797d272bd631b4e6f7a3e639b8be978
MD5 4eb962eed45838c02ec1743cc8353d46
BLAKE2b-256 d1e63b57bd9170b9cc436f9f8ade7f319e5419b73cfddd3fe716d205d40a3d3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page