Exact out-of-core Cox proportional hazards regression via streaming Newton-Raphson

These details have not been verified by PyPI

Project links

Project description

coxstream

Exact out-of-core Cox proportional hazards regression via streaming Newton-Raphson.

Standard CoxPH solvers (lifelines, scikit-survival, R survival) load the full cohort into memory before fitting, so on registry-scale data they exhaust RAM long before the computation is hard. coxstream computes the exact Efron partial-likelihood estimate by streaming a single time-sorted pass over the data per Newton-Raphson iteration, holding only O(p^2) state for p covariates. Working memory is therefore independent of the number of observations n: the model fits on a workstation even when the cohort is far larger than RAM.

The streamed estimate is the in-memory maximum-likelihood estimate, and the Efron tie correction is carried across chunk boundaries, so heavily tied data are handled exactly.

coxstream holds peak RAM flat as the cohort grows, while in-memory solvers (lifelines, R survival::coxph) scale with n; coefficients agree to machine precision.

Memory vs. speed against lifelines and R survival::coxph: coxstream's peak RAM stays flat in the number of rows while in-memory solvers grow with the cohort, at matching coefficients. See the accompanying paper for the full methodology.

Install

pip install coxstream             # core (numpy only)
pip install coxstream[parquet]    # + out-of-core fit_parquet (pyarrow)

The package builds a small Cython kernel, so a C compiler is required.

Usage

In memory:

import numpy as np
from coxstream import CoxStream

model = CoxStream().fit(durations, events, X, feature_names=names)
print(model.coef_, model.n_iter_)

Out of core, from a Parquet file pre-sorted by descending event time (never materialises the cohort):

from coxstream import CoxStream

# The file must already be sorted by duration DESC. `fit_parquet` verifies this
# from the Parquet footer statistics alone (no full pass) and rejects a file
# that is out of order; pass assume_sorted=True to skip the check.
#
# Sort it once with an out-of-core sorter -- both spill to disk, so they handle
# a cohort larger than RAM (a sort-engine benchmark found these the fastest):
#   duckdb:  COPY (SELECT * FROM 'cohort.parquet' ORDER BY duration DESC)
#            TO 'cohort_desc.parquet' (FORMAT PARQUET);
#   polars:  (pl.scan_parquet("cohort.parquet")
#              .sort("duration", descending=True)
#              .sink_parquet("cohort_desc.parquet"))
#   R:       duckdb via its R client runs the same COPY ... ORDER BY DESC.
# If the cohort fits in RAM, skip the file and call .fit, which sorts for you.

model = CoxStream().fit_parquet(
    "cohort_desc.parquet",
    duration_col="duration",
    event_col="event",
    covariate_cols=["age_std", "sex", "treatment"],
)
print(model.coef_)

To validate a file's order ahead of time -- a dry run, e.g. a CI or pipeline gate right after you sort and before a long fit -- call check_sorted, which runs the same footer-only check without fitting and raises on a file that is provably out of order:

from coxstream import check_sorted

check_sorted("cohort_desc.parquet", duration_col="duration")  # raises if unsorted

It doubles as a shell gate -- it exits non-zero on an out-of-order file, so a pipeline step can fail fast without a bespoke CLI:

python -c "import coxstream; coxstream.check_sorted('cohort_desc.parquet', 'duration')"

Validation

coxstream is verified against lifelines and R survival::coxph:

It reproduces the in-memory maximum-likelihood estimate to machine precision on synthetic data.
On the heavily tied Synthea 100K cohort (51 % of event times tied) it matches lifelines to ~1e-6.
Peak resident memory is flat in n while in-memory solvers grow with the cohort and eventually exhaust RAM.

The package's own test suite is dependency-free: it checks exactness against a self-contained plain-numpy Cox Newton-Raphson reference. The cross-checks against lifelines and R survival::coxph above live in the accompanying benchmark and paper.

The methodology and full results are in the accompanying paper (see Citation).

Scope

coxstream implements the exact Efron partial likelihood for large-n, modest-p tabular survival data. It is a focused estimator, not a full survival suite: it does not provide baseline-hazard estimation, time-varying covariates, or proportional-hazards diagnostics.

Testing

pip install -e '.[test]'           # core suite (numpy only)
pip install -e '.[test,parquet]'   # + the out-of-core fit_parquet test
pytest

Citation

If you use coxstream, please cite it via the metadata in CITATION.cff.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coxstream-0.1.0.tar.gz (178.5 kB view details)

Uploaded Jun 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

coxstream-0.1.0-cp313-cp313-macosx_10_13_universal2.whl (346.1 kB view details)

Uploaded Jun 14, 2026 CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

File details

Details for the file coxstream-0.1.0.tar.gz.

File metadata

Download URL: coxstream-0.1.0.tar.gz
Upload date: Jun 14, 2026
Size: 178.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for coxstream-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c5af06e29b67510383627a24d6908090d2e5103cf615259d27afb93096d0c2bd`
MD5	`d869d3e24b45cb028c2d895301fa2652`
BLAKE2b-256	`1ab4aae39ed97e2363f389629017b53448d4666a3f5db649cbf99d30d8051c0a`

See more details on using hashes here.

File details

Details for the file coxstream-0.1.0-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

Download URL: coxstream-0.1.0-cp313-cp313-macosx_10_13_universal2.whl
Upload date: Jun 14, 2026
Size: 346.1 kB
Tags: CPython 3.13, macOS 10.13+ universal2 (ARM64, x86-64)
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for coxstream-0.1.0-cp313-cp313-macosx_10_13_universal2.whl
Algorithm	Hash digest
SHA256	`9501f0aef76bcb52551b027fd0ca385a99405966b30399313e3f2f3abfae1171`
MD5	`6275f4fcb1264d38a4a57d095e49c6a1`
BLAKE2b-256	`db434b06a314f3884e4e988c85a1b09f61b213512d5fe98f2a1644b911d93c20`

See more details on using hashes here.

coxstream 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

coxstream

Install

Usage

Validation

Scope

Testing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes