Skip to main content

Exact out-of-core Cox proportional hazards regression via streaming Newton-Raphson

Project description

coxstream

Exact out-of-core Cox proportional hazards regression via streaming Newton-Raphson.

PyPI

Standard CoxPH solvers (lifelines, scikit-survival, R survival) load the full cohort into memory before fitting, so on registry-scale data they exhaust RAM long before the computation is hard. coxstream computes the exact Efron partial-likelihood estimate by streaming a single time-sorted pass over the data per Newton-Raphson iteration, holding only O(p^2) state for p covariates. Working memory is therefore independent of the number of observations n: the model fits on a workstation even when the cohort is far larger than RAM.

The streamed estimate is the in-memory maximum-likelihood estimate, and the Efron tie correction is carried across chunk boundaries, so heavily tied data are handled exactly.

coxstream holds peak RAM flat as the cohort grows, while in-memory solvers (lifelines, R survival::coxph) scale with n; coefficients agree to machine precision.

Memory vs. speed against lifelines and R survival::coxph: coxstream's peak RAM stays flat in the number of rows while in-memory solvers grow with the cohort, at matching coefficients. See the accompanying paper for the full methodology.

Install

pip install coxstream             # core (numpy only)
pip install coxstream[parquet]    # + out-of-core fit_parquet (pyarrow)

The package builds a small Cython kernel, so a C compiler is required.

Usage

In memory:

import numpy as np
from coxstream import CoxStream

model = CoxStream().fit(durations, events, X, feature_names=names)
print(model.coef_, model.n_iter_)

Out of core, from a Parquet file pre-sorted by descending event time (never materialises the cohort):

from coxstream import CoxStream

# The file must already be sorted by duration DESC. `fit_parquet` verifies this
# from the Parquet footer statistics alone (no full pass) and rejects a file
# that is out of order; pass assume_sorted=True to skip the check.
#
# Sort it once with an out-of-core sorter -- both spill to disk, so they handle
# a cohort larger than RAM (a sort-engine benchmark found these the fastest):
#   duckdb:  COPY (SELECT * FROM 'cohort.parquet' ORDER BY duration DESC)
#            TO 'cohort_desc.parquet' (FORMAT PARQUET);
#   polars:  (pl.scan_parquet("cohort.parquet")
#              .sort("duration", descending=True)
#              .sink_parquet("cohort_desc.parquet"))
#   R:       duckdb via its R client runs the same COPY ... ORDER BY DESC.
# If the cohort fits in RAM, skip the file and call .fit, which sorts for you.

model = CoxStream().fit_parquet(
    "cohort_desc.parquet",
    duration_col="duration",
    event_col="event",
    covariate_cols=["age_std", "sex", "treatment"],
)
print(model.coef_)

To validate a file's order ahead of time -- a dry run, e.g. a CI or pipeline gate right after you sort and before a long fit -- call check_sorted, which runs the same footer-only check without fitting and raises on a file that is provably out of order:

from coxstream import check_sorted

check_sorted("cohort_desc.parquet", duration_col="duration")  # raises if unsorted

It doubles as a shell gate -- it exits non-zero on an out-of-order file, so a pipeline step can fail fast without a bespoke CLI:

python -c "import coxstream; coxstream.check_sorted('cohort_desc.parquet', 'duration')"

Validation

coxstream is verified against lifelines and R survival::coxph:

  • It reproduces the in-memory maximum-likelihood estimate to machine precision on synthetic data.
  • On the heavily tied Synthea 100K cohort (51 % of event times tied) it matches lifelines to ~1e-6.
  • Peak resident memory is flat in n while in-memory solvers grow with the cohort and eventually exhaust RAM.

The package's own test suite is dependency-free: it checks exactness against a self-contained plain-numpy Cox Newton-Raphson reference. The cross-checks against lifelines and R survival::coxph above live in the accompanying benchmark and paper.

The methodology and full results are in the accompanying paper (see Citation).

Scope

coxstream implements the exact Efron partial likelihood for large-n, modest-p tabular survival data. It is a focused estimator, not a full survival suite: it does not provide baseline-hazard estimation, time-varying covariates, or proportional-hazards diagnostics.

Testing

pip install -e '.[test]'           # core suite (numpy only)
pip install -e '.[test,parquet]'   # + the out-of-core fit_parquet test
pytest

Citation

If you use coxstream, please cite it via the metadata in CITATION.cff.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coxstream-0.1.0.tar.gz (178.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coxstream-0.1.0-cp313-cp313-macosx_10_13_universal2.whl (346.1 kB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

File details

Details for the file coxstream-0.1.0.tar.gz.

File metadata

  • Download URL: coxstream-0.1.0.tar.gz
  • Upload date:
  • Size: 178.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for coxstream-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c5af06e29b67510383627a24d6908090d2e5103cf615259d27afb93096d0c2bd
MD5 d869d3e24b45cb028c2d895301fa2652
BLAKE2b-256 1ab4aae39ed97e2363f389629017b53448d4666a3f5db649cbf99d30d8051c0a

See more details on using hashes here.

File details

Details for the file coxstream-0.1.0-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for coxstream-0.1.0-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 9501f0aef76bcb52551b027fd0ca385a99405966b30399313e3f2f3abfae1171
MD5 6275f4fcb1264d38a4a57d095e49c6a1
BLAKE2b-256 db434b06a314f3884e4e988c85a1b09f61b213512d5fe98f2a1644b911d93c20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page