Streaming generalized linear models with bounded memory.
Project description
renew-glm
Streaming generalized linear models with bounded memory -- a Python port of the renewable-estimation algorithm of Luo & Song (2020).
from renew_glm import RenewGLM
# Streaming (recommended): one chunk at a time, never stores prior chunks.
# Peak RAM is O(p^2 + chunk_size * p), independent of n.
def chunk_fn():
for X_chunk, y_chunk in source_iter(): # your generator
yield X_chunk, y_chunk # X must include intercept column
model = RenewGLM(family="binomial").fit_streaming(chunk_fn)
print(model.coef_, model.n_iter_)
# Or chunk-buffered (matches the original R API):
model = RenewGLM(family="poisson")
for X_chunk, y_chunk in chunks:
model.partial_fit(X_chunk, y_chunk)
model.fit()
print(model.coef_, model.se_, model.pvalue_)
chunk_fn is a zero-argument callable returning an iterator of (X, y)
tuples. fit_streaming consumes the iterator once and discards each chunk
after use; partial_fit + fit buffers all chunks in memory so it can
compute se_ and pvalue_ like the R reference.
Supports gaussian, binomial, and poisson families. Coefficients converge to
the maximum-likelihood point of the full data; agreement with
statsmodels.GLM is verified to ~1e-3 on the test suite.
Why
Standard in-memory tools (statsmodels.GLM, R's glm()) load the
entire design matrix before fitting. At n = 16 M rows, statsmodels.GLM
allocates ~8 GB and OOMs above n ~ 20 M on a 16 GB laptop. This package
fits the same model in bounded memory -- one chunk at a time,
O(p^2) state regardless of n.
Other Python GLM options exist with different trade-offs:
glum(Quantco) -- fast in-memory; bit-identical to the closed-form MLE we get, but caps out where the full design matrix fits in RAM, same as statsmodels. Recommended if yournfits.dask-glm-- distributed; useful at multi-machine scale, has scheduler overhead at single-host scale.pyglmnet-- gradient-based, regularization-focused; supports several families but the unregularized path is slower than IRLS.scikit-learnSGD -- online SGD, approximate (not exact MLE).
This package targets the gap: exact-MLE streaming on a single host,
no full-design-matrix RAM. It carries the same Wald inference the
R reference (biglm, RenewGLM_pkg) exposes.
Install
pip install renew-glm
Requires only NumPy and SciPy. Pure Python -- no C extension, no compile step, no platform-specific wheels.
Correctness
Across seven independent implementations spanning Python (NumPy / Numba
/ JAX), an in-memory Python competitor (glum), a distributed Python
competitor (dask-glm), and a cross-language reference (R's biglm), the
coefficient estimates agree to floating-point machine epsilon
(~1e-15) on the largest workload we test. The heatmap inset in the
benchmark figure above shows pairwise log10(max|beta_i - beta_j|):
renew-glm hits FP epsilon (-15) against the streaming-Cholesky
cluster; in-memory in-memory references (glum, statsmodels,
glm (R)) sit one order out at -10/-11 (algorithm-decomposition
path, not convergence). The JAX backend uses jax_enable_x64=True
-- the silent-float32 default would land an order of magnitude
looser.
Scaling
The right-side line panels in the benchmark figure show wall time
and peak RAM as n grows from 1M to 128M rows (cold runs). The
in-memory baselines (statsmodels, glm (R)) terminate at the
OOM cliff (~9 GB); renew-glm stays flat at bounded RAM.
Algorithm
For each chunk:
- Compute Fisher information
H_b = X' diag(W_b) Xat the current coefficient. - Inner Newton-Raphson against
H_b + sum_of_prior_Fishers, with a penalty term that pulls toward the previous coefficient. - Update
sum_of_prior_Fishers += H_band move to the next chunk.
One outer pass over the chunks suffices because the penalty term lets each
chunk contribute meaningfully without revisiting prior data. The docstring
in _irls.py matches the paper notation.
Differences from the original R package
- No SE in the streaming path.
fit_streaming(chunk_fn)returns onlycoef_andn_iter_. The chunk-bufferedpartial_fit+fit()path computesse_andpvalue_like the R version. - Pure NumPy/SciPy. No C extension; portable and easy to install.
- Convergence tolerance uses
|g' d_beta| < tol(same as the R version'sdf_betacriterion).
Roadmap (post v0.1.0)
Deferred to keep the first release minimal; pull requests welcome.
- Formula API (
patsy/formulaic) -- afrom_formula("y ~ x1 + C(category)", chunk_source=...)constructor so users coming fromstatsmodels.GLM.from_formula(...)or R'sglm(y ~ ...)don't have to build the design matrix themselves. The streaming twist: category levels must be discovered before fitting, so the API will require either a first-passdiscover_levels(source)step or an explicitlevels={...}argument. First-chunk-lock-in (reject any chunk introducing a new level, with a clear error) is the most likely default. Workaround today: usepatsy.dmatrices(...)per-chunk and feed the resulting arrays topartial_fit/fit_streaming. - Gamma + inverse-Gaussian families -- the algorithm generalises
(any exponential-dispersion family with a known link works), but
the test suite only covers gaussian / binomial / poisson. Adding a
family is ~10 LOC of weight + mu functions in
_irls.pyplus a test case. - Optional Cholesky -> Givens-QR path -- the current code uses
scipy.linalg.cho_factoronX' W X + sum_prior_Fisher. For ill-conditioned designs a Givens-QR path on[W^{1/2} X ; previous-R]would be more numerically stable; the bench's R reference (biglm) already does this and our coefficients agree to ~1e-6, suggesting the Cholesky path is sufficient for typical inputs. Track this if a real user hits a conditioning problem.
Credit
-
Algorithm by Lan Luo and Peter X.-K. Song. Please cite the original paper when using this package:
Luo L. & Song R. (2020). "Renewable Estimation and Incremental Inference in Generalized Linear Models with Streaming Data Sets." Journal of the Royal Statistical Society: Series B, 82(1), 69-97. https://doi.org/10.1111/rssb.12352
-
Reference R implementation by Luo & Song: https://github.com/luolsph/RenewGLM_pkg
-
Python port by Tommy Carstensen. ORCID: https://orcid.org/0000-0002-3672-9931
License
GPL-2.0-or-later, matching the license of the original R package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file renew_glm-0.1.1.tar.gz.
File metadata
- Download URL: renew_glm-0.1.1.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cb70c4c83c3f68a5592148461387af775a9820cc9a3d4b2bbdb582c669d7a36
|
|
| MD5 |
98f3361b4c254473faf6addacd04cc76
|
|
| BLAKE2b-256 |
49a7db3c7d65f1488e150dc98dbafb530382ae48c766ae18686fb5bbe08acd4f
|
File details
Details for the file renew_glm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: renew_glm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d30c2502b18f83b1e639edc00392af62c071cd2d5354718a36af85fd54bd66d9
|
|
| MD5 |
53dba654ee0480e9d2a9e39ebef22a56
|
|
| BLAKE2b-256 |
16eb138c98766ae04dcc2f7d944ba602b153971020114c92290319c7d84ab9df
|