Skip to main content

Streaming generalized linear models with bounded memory.

Project description

renew-glm

Streaming generalized linear models with bounded memory -- a Python port of the renewable-estimation algorithm of Luo & Song (2020).

Cost trade-off at n=128M gaussian

from renew_glm import RenewGLM

# Streaming (recommended): one chunk at a time, never stores prior chunks.
# Peak RAM is O(p^2 + chunk_size * p), independent of n.
def chunk_fn():
    for X_chunk, y_chunk in source_iter():   # your generator
        yield X_chunk, y_chunk               # X must include intercept column

model = RenewGLM(family="binomial").fit_streaming(chunk_fn)
print(model.coef_, model.n_iter_)

# Or chunk-buffered (matches the original R API):
model = RenewGLM(family="poisson")
for X_chunk, y_chunk in chunks:
    model.partial_fit(X_chunk, y_chunk)
model.fit()
print(model.coef_, model.se_, model.pvalue_)

chunk_fn is a zero-argument callable returning an iterator of (X, y) tuples. fit_streaming consumes the iterator once and discards each chunk after use; partial_fit + fit buffers all chunks in memory so it can compute se_ and pvalue_ like the R reference.

Supports gaussian, binomial, and poisson families. Coefficients converge to the maximum-likelihood point of the full data; agreement with statsmodels.GLM is verified to ~1e-3 on the test suite.

Why

Standard in-memory tools (statsmodels.GLM, R's glm()) load the entire design matrix before fitting. At n = 16 M rows, statsmodels.GLM allocates ~8 GB and OOMs above n ~ 20 M on a 16 GB laptop. This package fits the same model in bounded memory -- one chunk at a time, O(p^2) state regardless of n.

Other Python GLM options exist with different trade-offs:

  • glum (Quantco) -- fast in-memory; bit-identical to the closed-form MLE we get, but caps out where the full design matrix fits in RAM, same as statsmodels. Recommended if your n fits.
  • dask-glm -- distributed; useful at multi-machine scale, has scheduler overhead at single-host scale.
  • pyglmnet -- gradient-based, regularization-focused; supports several families but the unregularized path is slower than IRLS.
  • scikit-learn SGD -- online SGD, approximate (not exact MLE).

This package targets the gap: exact-MLE streaming on a single host, no full-design-matrix RAM. It carries the same Wald inference the R reference (biglm, RenewGLM_pkg) exposes.

Install

pip install renew-glm

Requires only NumPy and SciPy. Pure Python -- no C extension, no compile step, no platform-specific wheels.

Correctness

Across seven independent implementations spanning Python (NumPy / Numba / JAX), an in-memory Python competitor (glum), a distributed Python competitor (dask-glm), and a cross-language reference (R's biglm), the coefficient estimates agree to floating-point machine epsilon (~1e-15) on the largest workload we test.

Cross-method coefficient agreement at n=128M

Each cell shows log10(max|beta_i - beta_j|). The renew-glm / dask-glm corners hit FP epsilon at -15; glum and biglm sit one order out at -10/-11 (algorithm-decomposition path, not convergence). The JAX entry uses jax_enable_x64=True -- the silent-float32 default would land an order of magnitude looser.

Scaling

Wall time and peak RAM as n grows from 4M to 128M rows (cold runs; single 16 GB laptop). The in-memory baselines (statsmodels) terminate where they OOM; renew-glm and biglm keep going at bounded RAM.

Scaling: wall time and peak RAM vs n

Algorithm

For each chunk:

  1. Compute Fisher information H_b = X' diag(W_b) X at the current coefficient.
  2. Inner Newton-Raphson against H_b + sum_of_prior_Fishers, with a penalty term that pulls toward the previous coefficient.
  3. Update sum_of_prior_Fishers += H_b and move to the next chunk.

One outer pass over the chunks suffices because the penalty term lets each chunk contribute meaningfully without revisiting prior data. The docstring in _irls.py matches the paper notation.

Differences from the original R package

  • No SE in the streaming path. fit_streaming(chunk_fn) returns only coef_ and n_iter_. The chunk-buffered partial_fit + fit() path computes se_ and pvalue_ like the R version.
  • Pure NumPy/SciPy. No C extension; portable and easy to install.
  • Convergence tolerance uses |g' d_beta| < tol (same as the R version's df_beta criterion).

Roadmap (post v0.1.0)

Deferred to keep the first release minimal; pull requests welcome.

  • Formula API (patsy / formulaic) -- a from_formula("y ~ x1 + C(category)", chunk_source=...) constructor so users coming from statsmodels.GLM.from_formula(...) or R's glm(y ~ ...) don't have to build the design matrix themselves. The streaming twist: category levels must be discovered before fitting, so the API will require either a first-pass discover_levels(source) step or an explicit levels={...} argument. First-chunk-lock-in (reject any chunk introducing a new level, with a clear error) is the most likely default. Workaround today: use patsy.dmatrices(...) per-chunk and feed the resulting arrays to partial_fit / fit_streaming.
  • Gamma + inverse-Gaussian families -- the algorithm generalises (any exponential-dispersion family with a known link works), but the test suite only covers gaussian / binomial / poisson. Adding a family is ~10 LOC of weight + mu functions in _irls.py plus a test case.
  • Optional Cholesky -> Givens-QR path -- the current code uses scipy.linalg.cho_factor on X' W X + sum_prior_Fisher. For ill-conditioned designs a Givens-QR path on [W^{1/2} X ; previous-R] would be more numerically stable; the bench's R reference (biglm) already does this and our coefficients agree to ~1e-6, suggesting the Cholesky path is sufficient for typical inputs. Track this if a real user hits a conditioning problem.

Credit

License

GPL-2.0-or-later, matching the license of the original R package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

renew_glm-0.1.0.tar.gz (316.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

renew_glm-0.1.0-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file renew_glm-0.1.0.tar.gz.

File metadata

  • Download URL: renew_glm-0.1.0.tar.gz
  • Upload date:
  • Size: 316.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for renew_glm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0e6b8de9bec0cbb2e4f255885718e059900157a910dd7e165cc9c1aa20a54932
MD5 ef95b62eb49b373aa677f6a27e029bd3
BLAKE2b-256 fce26db93ae1dbfe10b5827ae7933ad5a25466ddac55800c83b321682355b08b

See more details on using hashes here.

File details

Details for the file renew_glm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: renew_glm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for renew_glm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b40869ecb637070da4315958e284ccfa7963cdce9085b5a0fbd5f224d7867ec4
MD5 b878547e4b72fb837ce0c8e3d10c946f
BLAKE2b-256 2e6e02e3f176948582920782bebfce223f838b060c4af9f4a1add8ae19f7bf85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page