Skip to main content

Streaming generalized linear models with bounded memory.

Project description

renew-glm

Streaming generalized linear models with bounded memory -- a Python port of the renewable-estimation algorithm of Luo & Song (2020).

GLM benchmark: cost, accuracy, scaling

from renew_glm import RenewGLM

# Streaming (recommended): one chunk at a time, never stores prior chunks.
# Peak RAM is O(p^2 + chunk_size * p), independent of n.
def chunk_fn():
    for X_chunk, y_chunk in source_iter():   # your generator
        yield X_chunk, y_chunk               # X must include intercept column

model = RenewGLM(family="binomial").fit_streaming(chunk_fn)
print(model.coef_, model.n_iter_)

# Or chunk-buffered (matches the original R API):
model = RenewGLM(family="poisson")
for X_chunk, y_chunk in chunks:
    model.partial_fit(X_chunk, y_chunk)
model.fit()
print(model.coef_, model.se_, model.pvalue_)

chunk_fn is a zero-argument callable returning an iterator of (X, y) tuples. fit_streaming consumes the iterator once and discards each chunk after use; partial_fit + fit buffers all chunks in memory so it can compute se_ and pvalue_ like the R reference.

Supports gaussian, binomial, and poisson families. Coefficients converge to the maximum-likelihood point of the full data; agreement with statsmodels.GLM is verified to ~1e-3 on the test suite.

Why

Standard in-memory tools (statsmodels.GLM, R's glm()) load the entire design matrix before fitting. At n = 16 M rows, statsmodels.GLM allocates ~8 GB and OOMs above n ~ 20 M on a 16 GB laptop. This package fits the same model in bounded memory -- one chunk at a time, O(p^2) state regardless of n.

Other Python GLM options exist with different trade-offs:

  • glum (Quantco) -- fast in-memory; bit-identical to the closed-form MLE we get, but caps out where the full design matrix fits in RAM, same as statsmodels. Recommended if your n fits.
  • dask-glm -- distributed; useful at multi-machine scale, has scheduler overhead at single-host scale.
  • pyglmnet -- gradient-based, regularization-focused; supports several families but the unregularized path is slower than IRLS.
  • scikit-learn SGD -- online SGD, approximate (not exact MLE).

This package targets the gap: exact-MLE streaming on a single host, no full-design-matrix RAM. It carries the same Wald inference the R reference (biglm, RenewGLM_pkg) exposes.

Install

pip install renew-glm

Requires only NumPy and SciPy. Pure Python -- no C extension, no compile step, no platform-specific wheels.

Correctness

Across seven independent implementations spanning Python (NumPy / Numba / JAX), an in-memory Python competitor (glum), a distributed Python competitor (dask-glm), and a cross-language reference (R's biglm), the coefficient estimates agree to floating-point machine epsilon (~1e-15) on the largest workload we test. The heatmap inset in the benchmark figure above shows pairwise log10(max|beta_i - beta_j|): renew-glm hits FP epsilon (-15) against the streaming-Cholesky cluster; in-memory in-memory references (glum, statsmodels, glm (R)) sit one order out at -10/-11 (algorithm-decomposition path, not convergence). The JAX backend uses jax_enable_x64=True -- the silent-float32 default would land an order of magnitude looser.

Scaling

The right-side line panels in the benchmark figure show wall time and peak RAM as n grows from 1M to 128M rows (cold runs). The in-memory baselines (statsmodels, glm (R)) terminate at the OOM cliff (~9 GB); renew-glm stays flat at bounded RAM.

Algorithm

For each chunk:

  1. Compute Fisher information H_b = X' diag(W_b) X at the current coefficient.
  2. Inner Newton-Raphson against H_b + sum_of_prior_Fishers, with a penalty term that pulls toward the previous coefficient.
  3. Update sum_of_prior_Fishers += H_b and move to the next chunk.

One outer pass over the chunks suffices because the penalty term lets each chunk contribute meaningfully without revisiting prior data. The docstring in _irls.py matches the paper notation.

Differences from the original R package

  • No SE in the streaming path. fit_streaming(chunk_fn) returns only coef_ and n_iter_. The chunk-buffered partial_fit + fit() path computes se_ and pvalue_ like the R version.
  • Pure NumPy/SciPy. No C extension; portable and easy to install.
  • Convergence tolerance uses |g' d_beta| < tol (same as the R version's df_beta criterion).

Roadmap (post v0.1.0)

Deferred to keep the first release minimal; pull requests welcome.

  • Formula API (patsy / formulaic) -- a from_formula("y ~ x1 + C(category)", chunk_source=...) constructor so users coming from statsmodels.GLM.from_formula(...) or R's glm(y ~ ...) don't have to build the design matrix themselves. The streaming twist: category levels must be discovered before fitting, so the API will require either a first-pass discover_levels(source) step or an explicit levels={...} argument. First-chunk-lock-in (reject any chunk introducing a new level, with a clear error) is the most likely default. Workaround today: use patsy.dmatrices(...) per-chunk and feed the resulting arrays to partial_fit / fit_streaming.
  • Gamma + inverse-Gaussian families -- the algorithm generalises (any exponential-dispersion family with a known link works), but the test suite only covers gaussian / binomial / poisson. Adding a family is ~10 LOC of weight + mu functions in _irls.py plus a test case.
  • Optional Cholesky -> Givens-QR path -- the current code uses scipy.linalg.cho_factor on X' W X + sum_prior_Fisher. For ill-conditioned designs a Givens-QR path on [W^{1/2} X ; previous-R] would be more numerically stable; the bench's R reference (biglm) already does this and our coefficients agree to ~1e-6, suggesting the Cholesky path is sufficient for typical inputs. Track this if a real user hits a conditioning problem.

Credit

License

GPL-2.0-or-later, matching the license of the original R package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

renew_glm-0.1.1.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

renew_glm-0.1.1-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file renew_glm-0.1.1.tar.gz.

File metadata

  • Download URL: renew_glm-0.1.1.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for renew_glm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3cb70c4c83c3f68a5592148461387af775a9820cc9a3d4b2bbdb582c669d7a36
MD5 98f3361b4c254473faf6addacd04cc76
BLAKE2b-256 49a7db3c7d65f1488e150dc98dbafb530382ae48c766ae18686fb5bbe08acd4f

See more details on using hashes here.

File details

Details for the file renew_glm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: renew_glm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for renew_glm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d30c2502b18f83b1e639edc00392af62c071cd2d5354718a36af85fd54bd66d9
MD5 53dba654ee0480e9d2a9e39ebef22a56
BLAKE2b-256 16eb138c98766ae04dcc2f7d944ba602b153971020114c92290319c7d84ab9df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page