Python port of TrialEmulation R package for causal analysis of observational time-to-event data

These details have not been verified by PyPI

Project links

Project description

TrialEmulation (Python)

Python port of the R TrialEmulation package for causal inference in observational time-to-event data.

Overview

The target trial emulation framework provides a principled approach to causal inference from observational data by explicitly specifying the hypothetical randomized trial (the "target trial") that would answer the research question of interest. This package implements methods to emulate such target trials using observational data in the person-time format, commonly found in electronic health records and administrative databases.

The core methodology involves expanding longitudinal data into a sequence of "nested" trials, where each eligible person-time contributes to one or more emulated trials depending on when they became eligible for treatment. Marginal structural models (MSMs) are then used to estimate treatment effects while accounting for time-varying confounding through inverse probability weighting.

This Python implementation provides the same functionality as the R TrialEmulation package, with a focus on performance through C++ extensions for computationally intensive operations, comprehensive type hints for better IDE support, and integration with the scientific Python ecosystem (pandas, NumPy, statsmodels).

Key Features

Target Trial Emulation: Expand observational person-time data into a sequence of emulated trials
Multiple Estimands: Support for intention-to-treat (ITT), per-protocol (PP), and as-treated analyses
Inverse Probability Weighting (IPW):
- Treatment switching weights for per-protocol and as-treated estimands
- Censoring weights to account for informative censoring
- Flexible covariate specification for numerator and denominator models
Marginal Structural Models: Fit weighted pooled logistic regression for time-to-event outcomes
Robust Variance Estimation: Cluster-robust standard errors using sandwich estimators
Case-Control Sampling: Efficient sampling of controls for large datasets
Performance Optimized: C++ extensions (pybind11) for computationally intensive operations
Type Safety: Comprehensive type hints throughout the codebase
Flexible Interface: Support for R-style formulas and Python-style specifications

Installation

Note: This package is not yet published to PyPI. Currently install from source.

From Source (Current Method)

git clone https://github.com/csainsbury/Trial-Emulation-R2P.git
cd Trial-Emulation-R2P
pip install -e .

Development Installation

To install with development dependencies (testing, linting):

pip install -e ".[dev]"

Requirements

Python 3.10 or higher
NumPy >= 1.20.0
pandas >= 1.3.0
statsmodels >= 0.13.0
scipy >= 1.7.0
pybind11 >= 2.10.0 (for C++ extensions)

Quick Start

Here's a minimal example demonstrating the core workflow:

import trial_emulation as te
import pandas as pd

# Load your data in long (person-time) format
# Required columns: id, period, treatment, outcome, eligible
data = pd.read_csv("your_data.csv")

# Step 1: Prepare data for trial emulation
# This expands the data into a sequence of emulated trials and calculates weights
prep = te.data_preparation(
    data=data,
    id="id",                    # Patient identifier
    period="period",            # Time period
    treatment="treatment",      # Treatment indicator (0/1)
    outcome="outcome",          # Outcome event indicator (0/1)
    eligible="eligible",        # Eligibility indicator (0/1)
    estimand_type="ITT",        # Intention-to-treat analysis
    outcome_cov=["age", "sex", "comorbidities"],  # Time-varying covariates
    use_censor_weights=True,    # Use censoring weights
    cense="censored",           # Censoring indicator
)

# Step 2: Fit marginal structural model
# This fits a weighted pooled logistic regression with robust standard errors
msm = te.trial_msm(
    data=prep,                   # Prepared data from step 1
    outcome_cov=["age", "sex"],  # Covariates to adjust for
    estimand_type="ITT",
)

# Step 3: View results with robust standard errors
print(msm.robust["summary"])

# Extract treatment effect estimate
treatment_effect = msm.model.params["assigned_treatment"]
robust_se = msm.robust["bse"]["assigned_treatment"]
print(f"Treatment effect: {treatment_effect:.3f} (SE: {robust_se:.3f})")

Understanding the Results

The MSM provides:

Coefficients: Log odds ratios for the effect of treatment on the outcome
Robust Standard Errors: Account for clustering by patient ID
Confidence Intervals: Based on robust standard errors
Model Summary: Full regression output with all covariates

For ITT analyses, the coefficient for assigned_treatment represents the effect of being assigned to treatment at trial baseline, regardless of subsequent adherence.

Data Format

Your input data should be in long (person-time) format with one row per person-period:

id	period	treatment	outcome	eligible	age	...
1	0	0	0	1	45	...
1	1	1	0	1	45	...
1	2	1	0	1	45	...
2	0	0	0	1	52	...
2	1	0	1	1	52	...

Key requirements:

Unique identifier (id): Patient or unit identifier
Time period (period): Sequential integer starting at 0
Treatment (treatment): Binary indicator (0 = untreated, 1 = treated)
Outcome (outcome): Binary indicator for the event of interest
Eligibility (eligible): Indicator for whether the person is eligible at that time
Covariates: Time-varying or baseline characteristics

Documentation

Full documentation is available at https://trial-emulation.readthedocs.io

Additional Resources

Examples

See the examples/ directory for complete working examples:

basic_usage.py - Simple ITT analysis workflow
itt_analysis.py - Detailed intention-to-treat example
per_protocol_analysis.py - Per-protocol analysis with artificial censoring

Citation

If you use this package in your research, please cite:

@software{trial_emulation_python,
  title = {TrialEmulation: Target Trial Emulation for Causal Inference},
  author = {Vermeulen, Xander and Sainsbury, Chris},
  year = {2026},
  url = {https://github.com/csainsbury/Trial-Emulation-R2P},
  version = {0.0.4.0}
}

For the methodology, please also cite the original R package and key papers:

Danaei G, García Rodríguez LA, Cantero OF, Logan RW, Hernán MA. Observational data for comparative effectiveness research: An emulation of randomised trials of statins and primary prevention of coronary heart disease. Statistical Methods in Medical Research. 2013;22(1):70-96.
Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. American Journal of Epidemiology. 2016;183(8):758-764.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to:

Report bugs
Suggest features
Submit pull requests
Set up a development environment

For major changes, please open an issue first to discuss what you would like to change.

Development Status

This package is in alpha (v0.0.x series). The API may change as we gather user feedback. The package is functional and tested internally but should be considered experimental for production use.

Current priorities:

Expanding test coverage
Adding more examples and tutorials
Performance optimization
Validation against R package results

Testing & Validation

Test Data Sources

The package test suite uses validated data from the original R TrialEmulation package to ensure compatibility and correctness. Specifically, we use the trial_example dataset which contains:

48,400 observations across 503 patients
Longitudinal structure with realistic treatment patterns
Known to work correctly with target trial emulation methods
Same data used in the R package documentation and vignettes

This approach ensures that our Python implementation produces results consistent with the established R implementation.

Test Coverage

Current test status (as of v0.0.4.0):

39 tests passing + 1 skipped (Python 3.14 compatibility) - all core functionality covered
27% code coverage across main modules (focused on core workflow validation)
✓ Integration tests pass with real R package data
✓ Core workflow validated: data_preparation() → trial_msm() → results
✓ Proven to work on real epidemiological data (1.9M+ observations after expansion)
✓ Multi-platform testing via GitHub Actions (Linux, macOS, Windows)

Key tested functionality:

Data preparation and trial expansion
Multiple estimand types (ITT, PP, As-Treated)
Inverse probability weighting (treatment and censoring)
Marginal structural model fitting
Robust variance estimation
Various analysis options (weight truncation, period filtering, etc.)

Running Tests

To run the test suite:

# Install development dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/

# Run with coverage report
pytest tests/ --cov=trial_emulation --cov-report=html

# Run only integration tests
pytest tests/ -m integration

The test suite includes:

Unit tests: Individual function and module testing
Integration tests: End-to-end workflow validation with real data
Edge case tests: Handling of unusual inputs and boundary conditions

Validation Against R Package

The Python implementation has been validated against the R package using:

Same example datasets (trial_example)
Comparison of key outputs (expanded data structure, weight calculations)
Integration tests that verify the complete workflow produces reasonable results

While minor numerical differences may exist due to differences in optimization algorithms and random number generation, the overall methodology and results are consistent with the R implementation.

Acknowledgments

This Python implementation is based on the R TrialEmulation package developed by the Causal-LDA team. We thank the original developers for their methodological contributions and open-source implementation.

The target trial emulation framework is built on seminal work by:

Miguel Hernán and James Robins (Harvard T.H. Chan School of Public Health)
The CAUSALab team

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Issues: Report bugs or request features via GitHub Issues
Discussions: Ask questions or share ideas in GitHub Discussions

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.4.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trial_emulation-0.0.4.0.tar.gz (46.2 kB view details)

Uploaded May 14, 2026 Source

File details

Details for the file trial_emulation-0.0.4.0.tar.gz.

File metadata

Download URL: trial_emulation-0.0.4.0.tar.gz
Upload date: May 14, 2026
Size: 46.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for trial_emulation-0.0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`39656c79763f61df4b557ca4f264c27f5797d272bd631b4e6f7a3e639b8be978`
MD5	`4eb962eed45838c02ec1743cc8353d46`
BLAKE2b-256	`d1e63b57bd9170b9cc436f9f8ade7f319e5419b73cfddd3fe716d205d40a3d3e`

See more details on using hashes here.

trial-emulation 0.0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TrialEmulation (Python)

Overview

Key Features

Installation

From Source (Current Method)

Development Installation

Requirements

Quick Start

Understanding the Results

Data Format

Documentation

Additional Resources

Examples

Citation

Contributing

Development Status

Testing & Validation

Test Data Sources

Test Coverage

Running Tests

Validation Against R Package

Acknowledgments

License

Support

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes