Skip to main content

Polars Least Squares Extension

Project description

Polars OLS

Least squares extension in Polars

Supports linear model estimation in Polars.

This package provides efficient rust implementations of common linear regression variants (OLS, WLS, Ridge, Elastic Net, Non-negative least squares, Recursive least squares) and exposes them as simple polars expressions which can easily be integrated into your workflow.

Why?

  1. High Performance: implementations are written in rust and make use of optimized rust linear-algebra crates & LAPACK routines. See benchmark section.
  2. Polars Integration: avoids unnecessary conversions from lazy to eager mode and to external libraries (e.g. numpy, sklearn) to do simple linear regressions. Chain least squares formulae like any other expression in polars.
  3. Efficient Implementations:
    • Numerically stable algorithms are chosen where appropriate (e.g. QR, Cholesky).
    • Flexible model specification allows arbitrary combination of sample weighting, L1/L2 regularization, & non-negativity constraints on parameters.
    • Efficient rank-1 update algorithms used for moving window regressions.
  4. Easy Parallelism: Computing OLS predictions, in parallel, across groups can not be easier: call .over() or group_by just like any other polars' expression and benefit from full Rust parallelism.
  5. Formula API: supports building models via patsy syntax: y ~ x1 + x2 + x3:x4 -1 (like statsmodels) which automatically converts to equivalent polars expressions.

Installation

First, you need to install Polars. Then run the below to install the polars-ols extension:

pip install polars-ols

API & Examples

Importing polars_ols will register the namespace least_squares provided by this package. You can build models either by either specifying polars expressions (e.g. pl.col(...)) for your targets and features or using the formula api (patsy syntax). All models support the following general (optional) arguments:

  • mode - a literal which determines the type of output produced by the model
  • null_policy - a literal which determines how to deal with missing data
  • add_intercept - a boolean specifying if an intercept feature should be added to the features
  • sample_weights - a column or expression providing non-negative weights applied to the samples

Remaining parameters are model specific, for example alpha penalty parameter used by regularized least squares models.

See below for basic usage examples. Please refer to the tests or demo notebook for detailed examples.

import polars as pl
import polars_ols as pls  # registers 'least_squares' namespace

df = pl.DataFrame({"y": [1.16, -2.16, -1.57, 0.21, 0.22, 1.6, -2.11, -2.92, -0.86, 0.47],
                   "x1": [0.72, -2.43, -0.63, 0.05, -0.07, 0.65, -0.02, -1.64, -0.92, -0.27],
                   "x2": [0.24, 0.18, -0.95, 0.23, 0.44, 1.01, -2.08, -1.36, 0.01, 0.75],
                   "group": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
                   "weights": [0.34, 0.97, 0.39, 0.8, 0.57, 0.41, 0.19, 0.87, 0.06, 0.34],
                   })

lasso_expr = pl.col("y").least_squares.lasso("x1", "x2", alpha=0.0001, add_intercept=True).over("group")
wls_expr = pls.compute_least_squares_from_formula("y ~ x1 + x2 -1", sample_weights=pl.col("weights"))

predictions = df.with_columns(lasso_expr.round(2).alias("predictions_lasso"),
                              wls_expr.round(2).alias("predictions_wls"))

print(predictions.head(5))
shape: (5, 7)
┌───────┬───────┬───────┬───────┬─────────┬───────────────────┬─────────────────┐
│ y     ┆ x1    ┆ x2    ┆ group ┆ weights ┆ predictions_lasso ┆ predictions_wls │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---     ┆ ---               ┆ ---             │
│ f64   ┆ f64   ┆ f64   ┆ i64   ┆ f64     ┆ f64               ┆ f64             │
╞═══════╪═══════╪═══════╪═══════╪═════════╪═══════════════════╪═════════════════╡
│ 1.16  ┆ 0.72  ┆ 0.24  ┆ 1     ┆ 0.34    ┆ 0.97              ┆ 0.93            │
│ -2.16 ┆ -2.43 ┆ 0.18  ┆ 1     ┆ 0.97    ┆ -2.23             ┆ -2.18           │
│ -1.57 ┆ -0.63 ┆ -0.95 ┆ 1     ┆ 0.39    ┆ -1.54             ┆ -1.54           │
│ 0.21  ┆ 0.05  ┆ 0.23  ┆ 1     ┆ 0.8     ┆ 0.29              ┆ 0.27            │
│ 0.22  ┆ -0.07 ┆ 0.44  ┆ 1     ┆ 0.57    ┆ 0.37              ┆ 0.36            │
└───────┴───────┴───────┴───────┴─────────┴───────────────────┴─────────────────┘

The mode parameter is used to set the type of the output returned by all methods ("predictions", "residuals", "coefficients", "statistics"). It defaults to returning predictions matching the input's length. Note that "statistics" is currently only supported for OLS/WLS/Ridge models.

In case "coefficients" is set the output is a polars Struct with coefficients as values and feature names as fields. It's output shape 'broadcasts' depending on context, see below:

coefficients = df.select(pl.col("y").least_squares.from_formula("x1 + x2", mode="coefficients")
                         .alias("coefficients"))

coefficients_group = df.select("group", pl.col("y").least_squares.from_formula("x1 + x2", mode="coefficients").over("group")
                        .alias("coefficients_group")).unique(maintain_order=True)

print(coefficients)
print(coefficients_group)
shape: (1, 1)
┌──────────────────────────────┐
│ coefficients                 │
│ ---                          │
│ struct[3]                    │
╞══════════════════════════════╡
│ {0.977375,0.987413,0.000757} │  # <--- coef for x1, x2, and intercept added by formula API
└──────────────────────────────┘
shape: (2, 2)
┌───────┬───────────────────────────────┐
│ group ┆ coefficients_group            │
│ ---   ┆ ---                           │
│ i64   ┆ struct[3]                     │
╞═══════╪═══════════════════════════════╡
│ 1     ┆ {0.995157,0.977495,0.014344}  │
│ 2     ┆ {0.939217,0.997441,-0.017599} │  # <--- (unique) coefficients per group
└───────┴───────────────────────────────┘

For dynamic models (like rolling_ols) or if in a .over, .group_by, or .with_columns context, the coefficients will take the shape of the data it is applied on. For example:

coefficients = df.with_columns(pl.col("y").least_squares.rls(pl.col("x1"), pl.col("x2"), mode="coefficients")
                         .over("group").alias("coefficients"))

print(coefficients.head())
shape: (5, 6)
┌───────┬───────┬───────┬───────┬─────────┬─────────────────────┐
│ y     ┆ x1    ┆ x2    ┆ group ┆ weights ┆ coefficients        │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---     ┆ ---                 │
│ f64   ┆ f64   ┆ f64   ┆ i64   ┆ f64     ┆ struct[2]           │
╞═══════╪═══════╪═══════╪═══════╪═════════╪═════════════════════╡
│ 1.16  ┆ 0.72  ┆ 0.24  ┆ 1     ┆ 0.34    ┆ {1.235503,0.411834} │
│ -2.16 ┆ -2.43 ┆ 0.18  ┆ 1     ┆ 0.97    ┆ {0.963515,0.760769} │
│ -1.57 ┆ -0.63 ┆ -0.95 ┆ 1     ┆ 0.39    ┆ {0.975484,0.966029} │
│ 0.21  ┆ 0.05  ┆ 0.23  ┆ 1     ┆ 0.8     ┆ {0.975657,0.953735} │
│ 0.22  ┆ -0.07 ┆ 0.44  ┆ 1     ┆ 0.57    ┆ {0.97898,0.909793}  │
└───────┴───────┴───────┴───────┴─────────┴─────────────────────┘

For plain OLS/WLS and Ridge models, support has been recently added for producing a simple statistical significance report. It can be used as such:

statistics = (df.select(
   pl.col("y").least_squares.ols(pl.col("x1", "x2"), mode="statistics", add_intercept=True)
)
.unnest("statistics")  # results stored in a nested series by default
.explode(["feature_names", "coefficients", "standard_errors", "t_values", "p_values"])
)

print(statistics)
shape: (3, 8)
┌─────────┬──────────┬─────────┬──────────────┬──────────────┬─────────────┬───────────┬───────────┐
│ r2      ┆ mae      ┆ mse     ┆ feature_name ┆ coefficients ┆ standard_er ┆ t_values  ┆ p_values  │
│ ---     ┆ ---      ┆ ---     ┆ s            ┆ ---          ┆ rors        ┆ ---       ┆ ---       │
│ f64     ┆ f64      ┆ f64     ┆ ---          ┆ f64          ┆ ---         ┆ f64       ┆ f64       │
│         ┆          ┆         ┆ str          ┆              ┆ f64         ┆           ┆           │
╞═════════╪══════════╪═════════╪══════════════╪══════════════╪═════════════╪═══════════╪═══════════╡
│ 0.99631 ┆ 0.061732 ┆ 0.00794 ┆ x1           ┆ 0.977375     ┆ 0.037286    ┆ 26.212765 ┆ 3.0095e-8 │
│ 0.99631 ┆ 0.061732 ┆ 0.00794 ┆ x2           ┆ 0.987413     ┆ 0.037321    ┆ 26.457169 ┆ 2.8218e-8 │
│ 0.99631 ┆ 0.061732 ┆ 0.00794 ┆ const        ┆ 0.000757     ┆ 0.037474    ┆ 0.02021   ┆ 0.98444   │
└─────────┴──────────┴─────────┴──────────────┴──────────────┴─────────────┴───────────┴───────────┘

Finally, for convenience, in order to compute out-of-sample predictions you can use: least_squares.{predict, predict_from_formula}. This saves you the effort of un-nesting the coefficients and doing the dot product in python and instead does this in Rust, as an expression. Usage is as follows:

df_test.select(pl.col("coefficients_train").least_squares.predict(pl.col("x1"), pl.col("x2")).alias("predictions_test"))

Supported Models

Currently, this extension package supports the following variants:

  • Ordinary Least Squares: least_squares.ols
  • Weighted Least Squares: least_squares.wls
  • Regularized Least Squares (Lasso / Ridge / Elastic Net) least_squares.{lasso, ridge, elastic_net}
  • Non-negative Least Squares: least_squares.nnls
  • Multi-target Least Squares: least_squares.multi_target_ols

As well as efficient implementations of moving window models:

  • Recursive Least Squares: least_squares.rls
  • Rolling / Expanding Window OLS: least_squares.{rolling_ols, expanding_ols}

An arbitrary combination of sample_weights, L1/L2 penalties, and non-negativity constraints can be specified with the least_squares.from_formula and least_squares.least_squares entry-points.

Solve Methods

polars-ols provides a choice over multiple supported numerical approaches per model (via solve_method flag), with implications on performance vs numerical accuracy. These choices are exposed to the user for full control, however, if left unspecified the package will choose a reasonable default depending on context.

For example, if you know you are dealing with highly collinear data, with unregularized OLS model, you may want to explicitly set solve_method="svd" so that the minimum norm solution is obtained.

Benchmark

The usual caveats of benchmarks apply here, but the below should still be indicative of the type of performance improvements to expect when using this package.

This benchmark was run on randomly generated data with pyperf on my Apple M2 Max macbook (32GB RAM, MacOS Sonoma 14.2.1). See benchmark.py for implementation.

n_samples=2_000, n_features=5

Model polars_ols Python Benchmark Benchmark Type Speed-up vs Python Benchmark
Least Squares (QR) 195 µs ± 6 µs 466 µs ± 104 µs Numpy (QR) 2.4x
Least Squares (SVD) 247 µs ± 5 µs 395 µs ± 69 µs Numpy (SVD) 1.6x
Ridge (Cholesky) 171 µs ± 8 µs 1.02 ms ± 0.29 ms Sklearn (Cholesky) 5.9x
Ridge (SVD) 238 µs ± 7 µs 1.12 ms ± 0.41 ms Sklearn (SVD) 4.7x
Weighted Least Squares 334 µs ± 13 µs 2.04 ms ± 0.22 ms Statsmodels 6.1x
Elastic Net (CD) 227 µs ± 7 µs 1.18 ms ± 0.19 ms Sklearn 5.2x
Recursive Least Squares 1.12 ms ± 0.23 ms 18.2 ms ± 1.6 ms Statsmodels 16.2x
Rolling Least Squares 1.99 ms ± 0.03 ms 22.1 ms ± 0.2 ms Statsmodels 11.1x

n_samples=10_000, n_features=100

Model polars_ols Python Benchmark Benchmark Type Speed-up vs Python Benchmark
Least Squares (QR) 17.6 ms ± 0.3 ms 44.4 ms ± 9.3 ms Numpy (QR) 2.5x
Least Squares (SVD) 23.8 ms ± 0.2 ms 26.6 ms ± 5.5 ms Numpy (SVD) 1.1x
Ridge (Cholesky) 5.36 ms ± 0.16 ms 475 ms ± 71 ms Sklearn (Cholesky) 88.7x
Ridge (SVD) 30.2 ms ± 0.4 ms 400 ms ± 48 ms Sklearn (SVD) 13.2x
Weighted Least Squares 18.8 ms ± 0.3 ms 80.4 ms ± 12.4 ms Statsmodels 4.3x
Elastic Net (CD) 22.7 ms ± 0.2 ms 138 ms ± 27 ms Sklearn 6.1x
Recursive Least Squares 270 ms ± 53 ms 57.8 sec ± 43.7 sec Statsmodels 1017.0x
Rolling Least Squares 371 ms ± 13 ms 4.41 sec ± 0.17 sec Statsmodels 11.9x
  • Numpy's lstsq (uses divide-and-conquer SVD) is already a highly optimized call into LAPACK and so the scope for speed-up is relatively limited, and the same applies to simple approaches like directly solving normal equations with Cholesky.
  • However, even in such problems polars-ols Rust implementations for matching numerical algorithms tend to outperform by ~2-3x
  • More substantial speed-up is achieved for the more complex models by working entirely in rust and avoiding overhead from back and forth into python.
  • Expect a large additional relative order-of-magnitude speed up to your workflow if it involved repeated re-estimation of models in (python) loops.

Credits & Related Projects

  • Rust linear algebra libraries faer and ndarray support the implementations provided by this extension package
  • This package was templated around the very helpful: polars-plugin-tutorial
  • The python package patsy is used for (optionally) building models from formulae
  • Please check out the extension package polars-ds for general data-science functionality in polars

Future Work / TODOs

  • Support generic types, in rust implementations, so that both f32 and f64 types are recognized. Right now data is cast to f64 prior to estimation
  • Add docs explaining supported models, signatures, and API

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ols-0.3.5.tar.gz (84.8 kB view details)

Uploaded Source

Built Distributions

polars_ols-0.3.5-cp38-abi3-win_amd64.whl (10.3 MB view details)

Uploaded CPython 3.8+ Windows x86-64

polars_ols-0.3.5-cp38-abi3-win32.whl (9.0 MB view details)

Uploaded CPython 3.8+ Windows x86

polars_ols-0.3.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_ols-0.3.5-cp38-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (13.6 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ i686

polars_ols-0.3.5-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (12.1 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARMv7l

polars_ols-0.3.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

polars_ols-0.3.5-cp38-abi3-macosx_11_0_arm64.whl (10.2 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_ols-0.3.5-cp38-abi3-macosx_10_12_x86_64.whl (11.3 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_ols-0.3.5.tar.gz.

File metadata

  • Download URL: polars_ols-0.3.5.tar.gz
  • Upload date:
  • Size: 84.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.1

File hashes

Hashes for polars_ols-0.3.5.tar.gz
Algorithm Hash digest
SHA256 b507aa33c920573f3e3a152b6cb4e4429b3892db5e393fd2cb5cd7cfb53d7e77
MD5 e15f2e646148a1a0596b03a1a8cfa5cb
BLAKE2b-256 2bda56fdbd7c00d0e9bf82fe88a55d4cd1132c834619456a375b1821da4bb865

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 075e27381781895a9ada73296777304fcd2752797c4f7e17f3c30b57ae4a2120
MD5 44efb4ccfd5ccc1aefacf9bf29edb35e
BLAKE2b-256 ea5f5a7b73ab8bead1b453a8b9fcca1bf024dd7b057177c19bbc523537c4bfcd

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-win32.whl.

File metadata

  • Download URL: polars_ols-0.3.5-cp38-abi3-win32.whl
  • Upload date:
  • Size: 9.0 MB
  • Tags: CPython 3.8+, Windows x86
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.1

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-win32.whl
Algorithm Hash digest
SHA256 8e33fc7d6cb62cb7930b50a5dd428ce9dfd5bb3587463672269ed65776aa8081
MD5 6b90ae8bdeb26146fe1ebd978b87cfbb
BLAKE2b-256 131db3f7c57ca9ab7bf878b766467e68f4adcc5e41a94eee68c8d0c7bcd948d2

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7b8c0f03c6bf6dd9e0384f4d95183989724fd8e5703d20daade1754631f5b7d9
MD5 8a201a22dfedaf65ab275f2236cca69a
BLAKE2b-256 0e40c760dc5eaa03bf6ff3e9d09c723b7d7b02521a5a45324910b74d6aa19480

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 5fe50e4032e7430e5ed154f426f4c83973a9e30120211afd122c597bea19828f
MD5 7179748646f6f89206d36670734968f1
BLAKE2b-256 b1bf95a3841b43ede52a01ec02ad59fcaac92804ab208637f2c25694833a5399

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 fa2880802d42076a935b5c1bd9e114e8f3bb4126290330b63aaf32eda6381a51
MD5 114285818cdbee52a4be9b2497fcb1b8
BLAKE2b-256 4da233b0cdbec45b859cf93657c4978ab2e4747235e5ab10138c16ce2622ada8

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fe5b08142fdd85d4beaecf9ab8bdf16b93db9919b2706755ae90e362a985f01d
MD5 53edbc8da0143af8bb3ded4d506acd89
BLAKE2b-256 20bd711cf2815e79d83e9057ffa893036d78210f3411bf638a78496635da49c9

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fe4b97b168c5404d796463e97db851a94607c79a909fdaeedc4629b56649a2fb
MD5 1e4d16e9e3d3c41328c5032ae3c63b88
BLAKE2b-256 d0618fe5d3bcc15a671336eb6cfe3b34c1ddcbd969f3552225d815dc729e22c5

See more details on using hashes here.

File details

Details for the file polars_ols-0.3.5-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_ols-0.3.5-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 79c71a38cc86de74a3312dec80f578a0c5dee0eec66064bde7d52b98c8b0cbe9
MD5 6a767d55bbf66fa87cbce47165e1c47f
BLAKE2b-256 306ac1a955f68f04ef18a6c1b8e185b23f49a34b86e72623f809166f2fbaa3c9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page