Skip to main content

No project description provided

Project description

Polars for Data Science

Discord | Documentation | User Guide | Want to Contribute?
pip install polars-ds

The Project

PDS is a modern take on data science and traditional tabular machine learning. It is dataframe-centric in design, and provides parallelism for free via Polars. It offers Polars syntax that works both in normal and aggregation contexts, and provides these conveniences to the end user without any additional dependency. It includes the most common functions from NumPy, SciPy, edit distances, KNN-related queries, EDA tools, feature engineering queries, etc. Yes, it only depends on Polars (unless you want to use the plotting functionalities and want to interop with NumPy). Most of the code is rewritten in Rust and is on par or even faster than existing functions in SciPy and Scikit-learn. The following are some examples:

Parallel evaluations of classification metrics on segments

import polars as pl
import polars_ds as pds

df.lazy().group_by("segments").agg( 
    pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
    pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()

shape: (2, 3)
┌──────────┬──────────┬──────────┐
 segments  roc_auc   log_loss 
 ---       ---       ---      
 str       f64       f64      
╞══════════╪══════════╪══════════╡
 a         0.497745  1.006438 
 b         0.498801  0.997226 
└──────────┴──────────┴──────────┘

Get all neighbors within radius r, call them best friends, and count the number

df.select(
    pl.col("id"),
    pds.query_radius_ptwise(
        pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in 3d space
        index = pl.col("id"),
        r = 0.1, 
        dist = "sql2", # squared l2
        parallel = True
    ).alias("best friends"),
).with_columns( # -1 to remove the point itself
    (pl.col("best friends").list.len() - 1).alias("best friends count")
).head()

shape: (5, 3)
┌─────┬───────────────────┬────────────────────┐
 id   best friends       best friends count 
 ---  ---                ---                
 u32  list[u32]          u32                
╞═════╪═══════════════════╪════════════════════╡
 0    [0, 811,  1435]   152                
 1    [1, 953,  1723]   159                
 2    [2, 355,  835]    243                
 3    [3, 102,  1129]   110                
 4    [4, 1280,  1543]  226                
└─────┴───────────────────┴────────────────────┘

Ridge Regression on Categories

df = pds.random_data(size=5_000, n_cols=0).select(
    pds.random(0.0, 1.0).alias("x1"),
    pds.random(0.0, 1.0).alias("x2"),
    pds.random(0.0, 1.0).alias("x3"),
    pds.random_int(0, 3).alias("categories")
).with_columns(
    y = pl.col("x1") * 0.5 + pl.col("x2") * 0.25 - pl.col("x3") * 0.15 + pds.random() * 0.0001
)

df.group_by("categories").agg(
    pds.query_lstsq(
        "x1", "x2", "x3", 
        target = "y",
        method = "l2",
        l2_reg = 0.05,
        add_bias = False
    ).alias("coeffs")
) 

shape: (3, 2)
┌────────────┬─────────────────────────────────┐
 categories  coeffs                          
 ---         ---                             
 i32         list[f64]                       
╞════════════╪═════════════════════════════════╡
 0           [0.499912, 0.250005, -0.149846 
 1           [0.499922, 0.250004, -0.149856 
 2           [0.499923, 0.250004, -0.149855 
└────────────┴─────────────────────────────────┘

Various String Edit distances

df.select( # Column "word", compared to string in pl.lit(). It also supports column vs column comparison
    pds.str_leven("word", pl.lit("asasasa"), return_sim=True).alias("Levenshtein"),
    pds.str_osa("word", pl.lit("apples"), return_sim=True).alias("Optimal String Alignment"),
    pds.str_jw("word", pl.lit("apples")).alias("Jaro-Winkler"),
)

In-dataframe statistical tests

df.group_by("market_id").agg(
    pds.query_ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
    pds.query_chi2("category_1", "category_2").alias("chi2-test"),
    pds.query_f_test("var1", group = "category_1").alias("f-test")
)

shape: (3, 4)
┌───────────┬──────────────────────┬──────────────────────┬─────────────────────┐
 market_id  t-test                chi2-test             f-test              
 ---        ---                   ---                   ---                 
 i64        struct[2]             struct[2]             struct[2]           
╞═══════════╪══════════════════════╪══════════════════════╪═════════════════════╡
 0          {2.072749,0.038272}   {33.487634,0.588673}  {0.312367,0.869842} 
 1          {0.469946,0.638424}   {42.672477,0.206119}  {2.148937,0.072536} 
 2          {-1.175325,0.239949}  {28.55723,0.806758}   {0.506678,0.730849} 
└───────────┴──────────────────────┴──────────────────────┴─────────────────────┘

Multiple Convolutions at once!

# Multiple Convolutions at once
# Modes: `same`, `left` (left-aligned same), `right` (right-aligned same), `valid` or `full`
# Method: `fft`, `direct`
# Currently slower than SciPy but provides parallelism because of Polars
df.select(
    pds.convolve("f", [-1, 0, 0, 0, 1], mode = "full", method = "fft"), # column f with the kernel given here
    pds.convolve("a", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
    pds.convolve("b", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
).head()

Tabular Machine Learning Data Transformation Pipeline

import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint

bp = (
    # If we specify a target, then target will be excluded from any transformations.
    Blueprint(df, name = "example", target = "approved") 
    .lowercase() # lowercase all columns
    .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
    # Impute loan_period by running a simple linear regression. 
    # Explicitly put target, since this is not the target for prediction. 
    .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
    .impute(["existing_emi"], method = "median")
    .append_expr( # generate some features
        pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
        pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
        pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
        pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
    )
    .scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
        cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
    ) # Scale the columns up to this point. The columns below won't be scaled
    .append_expr(
        # Add missing flags
        pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
    )
    .one_hot_encode("gender", drop_first=True)
    .woe_encode("city_category") # No need to specify target because we initialized bp with a target
    .target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)

pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks!
df_transformed = pipe.transform(df)
df_transformed.head()

And more!

Getting Started

import polars_ds as pds

To make full use of the Diagnosis module, do

pip install "polars_ds[plot]"

More Examples

See this for Polars Extensions: notebook

See this for Native Polars DataFrame Explorative tools: notebook

HELP WANTED!

  1. Documentation writing, Doc Review, and Benchmark preparation

Road Map

  1. Standalone KNN and linear regression module.
  2. K-means, K-medoids clustering as expressions and also standalone modules.
  3. Other.

Disclaimer

Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars and will try to be as stable as possible for polars>=0.20.6. Exceptions will be made when Polars's update forces changes in the plugins.

This package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed.

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT) and internalized. See here
  3. Graph functionalities are powered by the petgragh crate. See here
  4. Linear algebra routines are powered partly by faer

Other related Projects

  1. Take a look at our friendly neighbor functime
  2. String similarity metrics is soooo fast and easy to use because of RapidFuzz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.5.3.tar.gz (2.1 MB view details)

Uploaded Source

Built Distributions

polars_ds-0.5.3-cp38-abi3-win_amd64.whl (13.9 MB view details)

Uploaded CPython 3.8+ Windows x86-64

polars_ds-0.5.3-cp38-abi3-manylinux_2_24_aarch64.whl (11.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.24+ ARM64

polars_ds-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_ds-0.5.3-cp38-abi3-macosx_11_0_arm64.whl (10.9 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_ds-0.5.3-cp38-abi3-macosx_10_12_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_ds-0.5.3.tar.gz.

File metadata

  • Download URL: polars_ds-0.5.3.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.1

File hashes

Hashes for polars_ds-0.5.3.tar.gz
Algorithm Hash digest
SHA256 b6902f66ef3eca500ca4c9e03476c070173af1c5de38896c0c7ab306cb1b3c58
MD5 1a65503b3a14219be5c851663e4d5757
BLAKE2b-256 bd7f577e08e2c16c6f6ef1daa8ee56d5b5be7818a3f3bf7c4ea6aafcd4acf145

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.3-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.3-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3febeaf63174f5d3bec7737b8cb496feb7d14f8eb5f90edc4d0d2f7d31e3365b
MD5 a14014776e9f766fec3193e159419bdd
BLAKE2b-256 e3a54d4d2ab746d6f2764b7e59ef2f38af706afd87975f89004a03a4d71d415b

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.3-cp38-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.3-cp38-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 f8871ea7d85d7e3476c96cc9e656bfb0ab0782d33f35cde79a9e9b5a1fedbe19
MD5 b56533d4ffe662509a02a4603f5e1d62
BLAKE2b-256 6d6cd9497d2a4f311a8cd86a4ef29f57caf6b06c4fa499a02b373373f62c28ff

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e0645437e10a84759d9e85b7eb6a5f7c6909c848d50ae9aec10600378415910e
MD5 9d676df05bb49f50d424c71017387265
BLAKE2b-256 8dd5b2cdfd0e70810dbfeae6bb781a475e3ef903e6674665c4fd2717aa3021f0

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.3-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 93aab6bd6f22e03b4f0b2c1606d5dd02711c71a7d34c7e74fd44cb577ae88485
MD5 39d54799e609673100e51327a9b01896
BLAKE2b-256 b322e32d2aa92ba04806a35d208abb5df9f2645e4e90ce49ef19d8038efd1743

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.3-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.3-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 443c4ce05a12070d9b5218def57da83bcca82138a041042e5db472f34c8dbfda
MD5 d7f094f0c779079751f729aafa70d8ff
BLAKE2b-256 e2f26026884d7da567f2cda11451ba4909d7e96bf86e44928be2471719acb195

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page