Skip to main content

No project description provided

Project description

Polars for Data Science

Discord | Documentation | User Guide | Want to Contribute?
pip install polars-ds

The Project

PDS is a modern take on data science and traditional tabular machine learning. It is dataframe-centric in design, and provides parallelism for free via Polars. It offers Polars syntax that works both in normal and aggregation contexts, and provides these conveniences to the end user without any additional dependency. It includes the most common functions from NumPy, SciPy, edit distances, KNN-related queries, EDA tools, feature engineering queries, etc. Yes, it only depends on Polars (unless you want to use the plotting functionalities and want to interop with NumPy). Most of the code is rewritten in Rust and is on par or even faster than existing functions in SciPy and Scikit-learn. The following are some examples:

Parallel evaluations of classification metrics on segments

import polars as pl
import polars_ds as pds

df.lazy().group_by("segments").agg( 
    pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
    pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()

shape: (2, 3)
┌──────────┬──────────┬──────────┐
 segments  roc_auc   log_loss 
 ---       ---       ---      
 str       f64       f64      
╞══════════╪══════════╪══════════╡
 a         0.497745  1.006438 
 b         0.498801  0.997226 
└──────────┴──────────┴──────────┘

Get all neighbors within radius r, call them best friends, and count the number

df.select(
    pl.col("id"),
    pds.query_radius_ptwise(
        pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in 3d space
        index = pl.col("id"),
        r = 0.1, 
        dist = "sql2", # squared l2
        parallel = True
    ).alias("best friends"),
).with_columns( # -1 to remove the point itself
    (pl.col("best friends").list.len() - 1).alias("best friends count")
).head()

shape: (5, 3)
┌─────┬───────────────────┬────────────────────┐
 id   best friends       best friends count 
 ---  ---                ---                
 u32  list[u32]          u32                
╞═════╪═══════════════════╪════════════════════╡
 0    [0, 811,  1435]   152                
 1    [1, 953,  1723]   159                
 2    [2, 355,  835]    243                
 3    [3, 102,  1129]   110                
 4    [4, 1280,  1543]  226                
└─────┴───────────────────┴────────────────────┘

Ridge Regression on Categories

df = pds.random_data(size=5_000, n_cols=0).select(
    pds.random(0.0, 1.0).alias("x1"),
    pds.random(0.0, 1.0).alias("x2"),
    pds.random(0.0, 1.0).alias("x3"),
    pds.random_int(0, 3).alias("categories")
).with_columns(
    y = pl.col("x1") * 0.5 + pl.col("x2") * 0.25 - pl.col("x3") * 0.15 + pds.random() * 0.0001
)

df.group_by("categories").agg(
    pds.query_lstsq(
        "x1", "x2", "x3", 
        target = "y",
        method = "l2",
        l2_reg = 0.05,
        add_bias = False
    ).alias("coeffs")
) 

shape: (3, 2)
┌────────────┬─────────────────────────────────┐
 categories  coeffs                          
 ---         ---                             
 i32         list[f64]                       
╞════════════╪═════════════════════════════════╡
 0           [0.499912, 0.250005, -0.149846 
 1           [0.499922, 0.250004, -0.149856 
 2           [0.499923, 0.250004, -0.149855 
└────────────┴─────────────────────────────────┘

Various String Edit distances

df.select( # Column "word", compared to string in pl.lit(). It also supports column vs column comparison
    pds.str_leven("word", pl.lit("asasasa"), return_sim=True).alias("Levenshtein"),
    pds.str_osa("word", pl.lit("apples"), return_sim=True).alias("Optimal String Alignment"),
    pds.str_jw("word", pl.lit("apples")).alias("Jaro-Winkler"),
)

In-dataframe statistical tests

df.group_by("market_id").agg(
    pds.query_ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
    pds.query_chi2("category_1", "category_2").alias("chi2-test"),
    pds.query_f_test("var1", group = "category_1").alias("f-test")
)

shape: (3, 4)
┌───────────┬──────────────────────┬──────────────────────┬─────────────────────┐
 market_id  t-test                chi2-test             f-test              
 ---        ---                   ---                   ---                 
 i64        struct[2]             struct[2]             struct[2]           
╞═══════════╪══════════════════════╪══════════════════════╪═════════════════════╡
 0          {2.072749,0.038272}   {33.487634,0.588673}  {0.312367,0.869842} 
 1          {0.469946,0.638424}   {42.672477,0.206119}  {2.148937,0.072536} 
 2          {-1.175325,0.239949}  {28.55723,0.806758}   {0.506678,0.730849} 
└───────────┴──────────────────────┴──────────────────────┴─────────────────────┘

Multiple Convolutions at once!

# Multiple Convolutions at once
# Modes: `same`, `left` (left-aligned same), `right` (right-aligned same), `valid` or `full`
# Method: `fft`, `direct`
# Currently slower than SciPy but provides parallelism because of Polars
df.select(
    pds.convolve("f", [-1, 0, 0, 0, 1], mode = "full", method = "fft"), # column f with the kernel given here
    pds.convolve("a", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
    pds.convolve("b", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
).head()

Tabular Machine Learning Data Transformation Pipeline

import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint

bp = (
    # If we specify a target, then target will be excluded from any transformations.
    Blueprint(df, name = "example", target = "approved") 
    .lowercase() # lowercase all columns
    .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
    # Impute loan_period by running a simple linear regression. 
    # Explicitly put target, since this is not the target for prediction. 
    .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
    .impute(["existing_emi"], method = "median")
    .append_expr( # generate some features
        pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
        pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
        pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
        pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
    )
    .scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
        cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
    ) # Scale the columns up to this point. The columns below won't be scaled
    .append_expr(
        # Add missing flags
        pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
    )
    .one_hot_encode("gender", drop_first=True)
    .woe_encode("city_category") # No need to specify target because we initialized bp with a target
    .target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)

pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks!
df_transformed = pipe.transform(df)
df_transformed.head()

And more!

Getting Started

import polars_ds as pds

To make full use of the Diagnosis module, do

pip install "polars_ds[plot]"

More Examples

See this for Polars Extensions: notebook

See this for Native Polars DataFrame Explorative tools: notebook

HELP WANTED!

  1. Documentation writing, Doc Review, and Benchmark preparation

Road Map

  1. Standalone KNN and linear regression module.
  2. K-means, K-medoids clustering as expressions and also standalone modules.
  3. Other.

Disclaimer

Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars (for most of its core) and will try to be as stable as possible for polars>=1 (It currently supports polars>=0.20.16 but that will be dropped soon). Exceptions will be made when Polars's update forces changes in the plugins.

This package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed.

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT) and internalized. See here
  3. Linear algebra routines are powered partly by faer

Other related Projects

  1. Take a look at our friendly neighbor functime
  2. String similarity metrics is soooo fast and easy to use because of RapidFuzz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.6.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distributions

polars_ds-0.6.0-cp39-abi3-win_amd64.whl (14.0 MB view details)

Uploaded CPython 3.9+ Windows x86-64

polars_ds-0.6.0-cp39-abi3-manylinux_2_24_aarch64.whl (11.4 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.24+ ARM64

polars_ds-0.6.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB view details)

Uploaded CPython 3.9+ manylinux: glibc 2.17+ x86-64

polars_ds-0.6.0-cp39-abi3-macosx_11_0_arm64.whl (11.0 MB view details)

Uploaded CPython 3.9+ macOS 11.0+ ARM64

polars_ds-0.6.0-cp39-abi3-macosx_10_12_x86_64.whl (12.4 MB view details)

Uploaded CPython 3.9+ macOS 10.12+ x86-64

File details

Details for the file polars_ds-0.6.0.tar.gz.

File metadata

  • Download URL: polars_ds-0.6.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for polars_ds-0.6.0.tar.gz
Algorithm Hash digest
SHA256 a8a4a559d5a4e350b05cb373720b82e38fa442794b14af45cebba12a2eb68576
MD5 4344732cd4a9c940a3d265d5192005c2
BLAKE2b-256 a00c6ccb64e72a4c61d827a919118e9a2ef78638b0d7184ca0e69018def02766

See more details on using hashes here.

File details

Details for the file polars_ds-0.6.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_ds-0.6.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 eb16e51ee6c671bfecffd6c22fbaf9715bc19388cc9636df832e07dfd4e6d109
MD5 82c85c60bda82b5f742f621ab439fd17
BLAKE2b-256 6a13c2bfd82f180b4ef5e0f8f295c597eb995ac12fb5c2570b73fe2544e79aec

See more details on using hashes here.

File details

Details for the file polars_ds-0.6.0-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for polars_ds-0.6.0-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 c65487c7166f10587305e19cc0762d9759391e057de680b6e0e26e4be245e668
MD5 e58214cfaec4d38dd915305f24c626c3
BLAKE2b-256 9211529663aa5d3a0a7afb06eb380c7402d943185fbefd6e7f97bb92361abc1d

See more details on using hashes here.

File details

Details for the file polars_ds-0.6.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.6.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f7b2eca8bb967472715cc3198b068417c604c4df1d7ac76e7c50a4fb2d1d901a
MD5 f661de580cb26d9542a210cc65142176
BLAKE2b-256 1dca2d72d08b5c7d4797a9f1d888b32e6c536ec36f68aca9e9b9988af1596244

See more details on using hashes here.

File details

Details for the file polars_ds-0.6.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_ds-0.6.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f8ad20f3d42f3a67290516a221ad786d4ccf938007bc50180c79dda720a22165
MD5 df307e2acd294dc274ad938379b9c1b4
BLAKE2b-256 3039fbc2b254ff1e2e33ea28630923602bc4d37d7b70473a64436719b9fc1b0f

See more details on using hashes here.

File details

Details for the file polars_ds-0.6.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.6.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b2d32497c341c9ac77c213418662fb0ccd448aabd98ae86729489f3c0d8c953c
MD5 31ea5fb1bc2a2b39121439fce041dc1c
BLAKE2b-256 ba81c3af5a23d74600725d8486dfee85b09f6617a7a41312b938d33ccce9ec85

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page