Skip to main content

Python toolkit for competing risks: forest (RSF) + (penalized) Fine-Gray subdistribution regression + Aalen-Johansen cumulative incidence + Gray's K-sample test + cause-specific Cox. Scales to n=10⁶ in ~1 min, 10–22× faster than randomForestSRC on real EHR data, scikit-learn-compatible.

Project description

comprisk

PyPI version CI DOI

comprisk — a Python toolkit for competing risks. Ships a scalable, scikit-learn-compatible competing-risks random survival forest plus the three canonical regression / non-parametric methods clinical researchers actually need: Fine-Gray subdistribution-hazard regression, a stand-alone Aalen-Johansen cumulative-incidence estimator with cmprsk-parity variance, and cause-specific Cox PH (see Roadmap). Designed to remove the Python → R workflow split that applied researchers currently endure for competing-risks survival analysis.

Status: alpha. API and internals may change before v1.0. Renamed from crforest in 0.3.1pip install comprisk, from comprisk import CompetingRiskForest.

Highlights

  • The four canonical CR methods, native Python. FineGrayRegression matches R cmprsk::crr() β̂ to floating-point noise (max |Δβ| = 1.4e-15 on three reference datasets); robust_se=True returns the Geskus cluster sandwich agreeing with cmprsk's IPCW-corrected SE to ~3 digits. CumulativeIncidence reproduces cmprsk::cuminc() to 1e-9 across CIF and variance. gray_test reproduces cmprsk::cuminc()$Tests to 1e-14. CauseSpecificCox matches survival::coxph(method="breslow") to 1e-9.
  • Only native-Python competing-risks RSF. Cause-specific log-rank splitting + composite CR log-rank, Aalen-Johansen CIF, Nelson-Aalen CHF, Wolbers + Uno IPCW concordance, OOB Breiman VIMP, Ishwaran minimal-depth variable selection, exact TreeSHAP.
  • CR-aware model evaluation. score_cr reports IPCW time-dependent AUC and Brier score under competing risks, plus integrated AUC / Brier (iAUC, IBS) with bootstrap CIs; calibration_cr returns tidy quantile- decile calibration data with per-bin Wilson intervals — one-call replacements for the CR-mode riskRegression::Score() / plotCalibration() blocks, taking a dict of named candidate models.
  • 10–22× faster than randomForestSRC on real EHR data (CHF 14–22×, SEER 11.6×; full tables in docs/benchmarks.md), with C ≈ 0.85 on both libraries. ~95× faster than rfSRC built without OpenMP (default R-on-macOS).
  • Order-of-magnitude faster than scikit-survival (16.6× at n = 5k, 544× at n = 50k), without disabling CIF/CHF outputs.
  • Bit-identical to randomForestSRC with equivalence="rfsrc" — reproduces the per-tree mtry/nsplit RNG stream for paper-grade reproducibility, sensitivity checks, and rfSRC-baseline migrations.

comprisk vs alternatives

comprisk randomForestSRC scikit-survival
Language Python R Python
Native competing risks ✗ (single-event only)
Aalen–Johansen CIF output n/a
Cumulative hazard at scale ✗¹
OOB permutation VIMP
Bit-identical reproducibility mode ✓ (equivalence="rfsrc") n/a
Scales to n = 10⁶ ✓ (63 s on i7) memory-bound past n ≈ 500 000 on consumer hardware ✗¹ / OOM²
Default parallelism ✓ (n_jobs=-1) OpenMP (build-dependent; macOS Apple clang lacks it)
GPU preview ✓ (CUDA 12)

¹ sksurv RandomSurvivalForest(low_memory=True) is the only mode that scales beyond ~10k samples, but it disables predict_cumulative_hazard_function and predict_survival_function (raises NotImplementedError). ² sksurv low_memory=False exposes CHF / survival outputs but stores per-leaf full CHF arrays; peak RSS reaches 16.8 GB at n = 5k on synthetic, OOMs (> 21.5 GB) at n = 10k on a 24 GB host.

Install

pip install comprisk          # or:  uv add comprisk
pip install "comprisk[gpu]"   # or:  uv add 'comprisk[gpu]'

Requires Python ≥ 3.10. Core dependencies: numpy, scipy, pandas, joblib, numba, scikit-learn. GPU extra adds cupy + CUDA 12 runtime libs (preview; faster only at low feature count today, full rewrite scheduled for v1.1).

Quickstart

import numpy as np
from comprisk import CompetingRiskForest

# Toy competing-risks data: 500 subjects, 6 features, 2 causes (+ censoring).
rng = np.random.default_rng(42)
n = 500
X = rng.normal(size=(n, 6))
time = rng.exponential(2.0, size=n) + 0.1
event = rng.choice([0, 1, 2], size=n, p=[0.4, 0.4, 0.2])  # 0 = censored

# Fit. Defaults: n_estimators=100, max_features="sqrt", logrankCR, n_jobs=-1.
forest = CompetingRiskForest(n_estimators=100, random_state=42).fit(X, time, event)

# Aalen-Johansen cumulative incidence over the forest's chosen time grid.
cif = forest.predict_cif(X[:5])                       # (5, n_causes, n_times)

# Cause-specific Wolbers concordance.
print("C-index, cause 1:", forest.score(X, time, event, cause=1))

Explainability and feature selection

# OOB permutation importance (Uno IPCW-scored).
vimp = forest.compute_importance(random_state=42)

# Ishwaran minimal-depth variable selection.
selected = forest.minimal_depth().query("selected")["feature"].tolist()

# Exact TreeSHAP attributions (Lundberg 2018, Algorithm 2).
shap, base = forest.shap_values(X[:10])               # (n, p, n_times, n_causes)

examples/shap_explain.py is an interactive marimo notebook (a plain .py file) that walks through SHAP additivity, per-cause global importance, and per-subject attribution over the time grid, with sliders for the forest size and the subject under inspection. Run it with uv run --extra examples marimo edit examples/shap_explain.py (or uvx marimo edit --sandbox examples/shap_explain.py to use the notebook's own PEP 723 dependency header).

Fine-Gray, Aalen-Johansen, Gray's test, and cause-specific Cox

from comprisk import (
    FineGrayRegression, CumulativeIncidence, CauseSpecificCox, gray_test,
)

# Fine-Gray subdistribution-hazard regression — matches R cmprsk::crr()
# β̂ to floating-point noise. robust_se=True gives the Geskus cluster
# sandwich (matches cmprsk's IPCW-corrected SE to ~3 digits).
fg = FineGrayRegression(cause=1, robust_se=True).fit(X, time=time, event=event)
print(fg.coef_, fg.se_)
F = fg.predict_cumulative_incidence(X[:5])            # (5, n_event_times)

# Non-parametric Aalen-Johansen CIF (cmprsk::cuminc parity, optional groups).
ci = CumulativeIncidence().fit(time=time, event=event, group=group_var)
est, var = ci.timepoints([1.0, 5.0, 10.0])            # (n_curves, 3)

# Gray's K-sample test for CIFs — matches cmprsk::cuminc()$Tests to 1e-14.
result = gray_test(time, event, group_var, cause=1)
print(result.stat, result.pvalue, result.df)

# Cause-specific Cox PH — competing events censored at t_j.
# Matches survival::coxph(method="breslow") to 1e-9.
cs = CauseSpecificCox(cause=1).fit(X, time=time, event=event)

Penalized variable selection for the Fine-Gray model (LASSO / ridge / elastic-net / MCP / SCAD) — no equivalent elsewhere in Python:

from comprisk import PenalizedFineGrayRegression

# Cyclic coordinate descent on the IPCW-weighted partial likelihood,
# warm-started along a 100-point lambda path. cv=K picks lambda by the
# cross-validated partial-likelihood deviance; coefficients + sandwich SEs
# match R crrp::crrp() (Fu et al. 2017) along the whole path to ~1e-6.
pen = PenalizedFineGrayRegression(penalty="lasso", cv=5).fit(X, time=time, event=event)
print(pen.coef_, pen.lambda_min_, pen.lambda_1se_)
pen.coef_path_                                        # (p, n_lambda)

Detailed walkthroughs — additivity checks, global SHAP importance, sklearn- compatible slicing, performance caveats, rfSRC threshold compatibility — in docs/quickstart.md, which also covers data format, prediction shapes, cross-validation, GPU, and rfSRC migration.

scikit-learn drop-in. CompetingRiskForest is a real sklearn estimator (BaseEstimator, clone()-friendly, picklable). cross_val_score, KFold, Pipeline work without a wrapper — pass Surv.from_arrays(event, time) as the y argument, or use the legacy 3-arg fit(X, time, event) form. Full example in docs/quickstart.md § Cross-validation.

Roadmap

comprisk is intentionally CR-focused. For non-CR survival methods (general Cox PH, AFT, parametric, deep-survival, Kaplan-Meier as a standalone API), use lifelines or scikit-survival.

Version Module Status
v0.3 CompetingRiskForest (CR-RSF) Shipped
v0.4 FineGrayRegression (subdistribution hazard) Shipped
v0.4 CumulativeIncidence (stand-alone Aalen-Johansen) Shipped
v0.4 gray_test (Gray's K-sample log-rank) Shipped
v0.4 CauseSpecificCox (CR-aware censoring) Shipped
v0.4 score_cr / calibration_cr (CR-aware evaluation) Shipped
v0.5 PenalizedFineGrayRegression (LASSO/ridge/elastic-net/MCP/SCAD) Shipped
v1.0 API freeze + JMLR MLOSS submission Planned
v1.1 Full GPU rewrite Planned

Benchmarks

Headline numbers — full tables, methodology, and reproducibility scripts in docs/benchmarks.md.

vs randomForestSRC, matched-pair on real EHR data:

Cohort n × p Hardware comprisk rfSRC OMP-on Speedup
CHF (cardio) 75k × 58 Apple M4 / i7-14700K / HPC 5.6–9.4 s 84.8–207.3 s 14–22×
SEER breast (oncology) 238k × 17 HPC Xeon Gold 6148 7.0 s 81.6 s 11.6×

Both libraries fit similarly well at every tested workload (HF / cancer-specific C ≈ 0.85). The 10–22× cross-dataset band tracks feature count: rfSRC's per-split exhaustive scan scales with p, so the gap narrows on lower-p cohorts. ~95× speedup vs rfSRC built without OpenMP (default R-on-macOS install).

vs scikit-survival, paired on i7-14700K — synthetic 2-cause Weibull, p = 58, both libraries at their best config:

n sksurv low_memory=True comprisk speedup
5 000 18.2 s 1.10 s 16.6×
50 000 2935 s (49 min) 5.40 s 544×

The gap widens super-linearly (sksurv ≈ n^2.2; comprisk ≈ n^0.7). comprisk also provides Aalen-Johansen CIF + Nelson-Aalen CHF that sksurv low_memory=True raises NotImplementedError for.

Scaling on a consumer desktop: n = 10⁶ in 63 s on i7-14700K, 14.5 GB RSS. Reproducible via validation/spikes/lambda/exp5_paper_scale_bench.py.

API

Full parameter list in src/comprisk/forest.py; usage by task in docs/quickstart.md. Two splitrules are available: logrankCR (composite competing-risks log-rank, default) and logrank (cause-specific).

Documentation

  • Quickstart — common tasks with runnable code
  • PRD — what comprisk aims to be at v1.0
  • Equivalence vs rfSRC — cross-library validation methodology
  • References — algorithmic provenance (Park-Miller, Bays-Durham, Wolbers 2009, Uno 2011, Cole & Hernán 2008, Breiman 2001, Ishwaran 2008/2014, etc.)

Development

Requires uv.

uv venv
uv pip install -e ".[dev]"
uv run pre-commit install
uv run pytest
uv run ruff check .
uv run ruff format --check .

License

Apache-2.0. See LICENSE and NOTICE.

Citation

@software{yang_comprisk_2026,
  author    = {Yang, Sunny and Zhao, Wanqi},
  title     = {{comprisk: a Python toolkit for competing risks}},
  year      = {2026},
  publisher = {Zenodo},
  version   = {0.3.1},
  doi       = {10.5281/zenodo.19876282},
  url       = {https://doi.org/10.5281/zenodo.19876282},
}

DOI is concept-level (always resolves to the latest version). GitHub's "Cite this repository" button generates a version-specific record from CITATION.cff. Algorithmic references in docs/REFERENCES.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comprisk-0.6.0.tar.gz (316.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

comprisk-0.6.0-py3-none-any.whl (143.0 kB view details)

Uploaded Python 3

File details

Details for the file comprisk-0.6.0.tar.gz.

File metadata

  • Download URL: comprisk-0.6.0.tar.gz
  • Upload date:
  • Size: 316.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for comprisk-0.6.0.tar.gz
Algorithm Hash digest
SHA256 5de06c96b6e20262309b146a218ccf14cd5766a3b991d282b8ca00e7df8654c1
MD5 98bc86b7d7f705943a6b0da75a8d8668
BLAKE2b-256 df7e0a3e52a55bd1a4c5cdf3f4be706292c28372e82dd40ee7b88207256c6269

See more details on using hashes here.

File details

Details for the file comprisk-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: comprisk-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 143.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for comprisk-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1595a128c93b069bcfd597916b4427f1852a104cd2d27095f681fd6e160519ce
MD5 d54535af37e33e4473951a517898c8f3
BLAKE2b-256 f7ceb052a8a00f49fbe9271689be9203ad7e94d45a60bf6adbf8815146bcd6a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page