Skip to main content

An iterative batch OLS regression

Project description

Batch Ordinary Least Squares regression

An OLS regression that allows you to iterate over your training data in batches. Useful when a normal implementation of linear regression does not fit into memory as this library is considerably more memory efficient than the standard implementation. Expects a vector of your dependent variable y as well as a column-ordered design matrix with your independent variables X. X needs to have the same shape for each iteration/update. Does not calculate intercepts, i.e. data has to be already centered or you have to add a dummy column to your data. Naturally supports multi-processing as the heavy lifting is done with numpy. Inspired by the answer of Chris Taylor on Stackoverlfow.

Installation

The library can be installed straight from PyPI.

pip install bols

The only dependencies are numpy and scipy and the library should work with all Python versions >= 3.6.

Usage

First generate some data.

>>> import numpy as np

>>> data_y = np.random.random_sample((15000,))
>>> data_x0 = np.random.random_sample((15000,))
>>> data_x1 = np.random.random_sample((15000,))
>>> data = np.column_stack((data_y, data_x0, data_x1))
>>> y_a = data[0:5000, 0]
>>> y_b = data[5000:10000, 0]
>>> y_c = data[10000:15000, 0]
>>> data_a = data[0:5000, 1:]
>>> data_b = data[5000:10000, 1:]
>>> data_c = data[10000:15000, 1:]

Then you can just fit a model. You need to pass an iterable of both your dependent and independent variables or in other words an iterable over your batches. The only limitation is that batches need to be of the same size.

>>> from bols import BOLS
>>> model = BOLS()
>>> model.batch([y_a, y_b], [data_a, data_b]) 

We can then also use the fitted model to predict unseen data.

>>> model.predict(data_c)
array([0.27206   , 0.42766053, 0.63881539, ..., 0.39375078, 0.44824941,
       0.4866372 ])

Alternatively, we can also update our model with new batches in the future.

>>> model.batch([y_c], [data_c])

We can also get a bunch of useful statistics about the regression with model.get_statistics(verbose=True) where verbose determines whether the method just returns the statistics or prints them as well.

>>> model.get_statistics(verbose=True)
OLS Regression Results

F:    13564.635
P>|F|:      0.0

  Variable    Coef.    Standard Error       t    P>|t|
----------  -------  ----------------  ------  -------
         0    0.429             0.007  58.225    0.000
         1    0.433             0.007  58.926    0.000

result(names=[0, 1], F=13564.634855542236, F_p_value=0.0, R2=0.6439835739923647, RMSE=0.3463917044171118, beta=array([0.42915076, 0.43304183]), se_beta=array([0.00737053, 0.00734891]), beta_p_value=array([0., 0.]))

Even though our data is purely random the regression and the coefficients are both statistically significant. It is up to the user to make sure linear regression is an appropriate model for the data by for example examining the residuals (model.errors).

Tests

The package is tested against both the implementations of linear regressions by sklearn and statsmodels. Those two packages thus become additional dependencies for running the tests.

Development

Using nix-shell default.nix drops you in a development shell with all dependencies already installed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bols-0.0.1.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

bols-0.0.1-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file bols-0.0.1.tar.gz.

File metadata

  • Download URL: bols-0.0.1.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.9

File hashes

Hashes for bols-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ab25ba811ba1100798d0862a3f98b4256655fa28279b4b786ac3f8c956e68b41
MD5 ea08135804badc963db19a038731b726
BLAKE2b-256 904ef3f9742bfcc8f2f1c94e030dad23ca17b3a2a28277ed23f64599ddc81690

See more details on using hashes here.

File details

Details for the file bols-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: bols-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.9

File hashes

Hashes for bols-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 327c8dad408e5d98b5508cdab5f5cb5e3270dc5f726414ec20346ea41fc6b7d7
MD5 37103b56c938aded9a83bf92e7da8545
BLAKE2b-256 81bfc6f4836611ed3b3632668245ed2f8f461e7ddca7e3ce0ee493626fe1405e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page