Skip to main content

No project description provided

Project description

Polars for Data Science

Discord | Documentation | User Guide | Want to Contribute?
pip install polars-ds

The Project

PDS is a modern take on data science and traditional tabular machine learning. It is dataframe-centric in design, and provides parallelism for free via Polars. It offers Polars syntax that works both in normal and aggregation contexts, and provides these conveniences to the end user without any additional dependency. It includes the most common functions from NumPy, SciPy, edit distances, KNN-related queries, EDA tools, feature engineering queries, etc. Yes, it only depends on Polars (unless you want to use the plotting functionalities and want to interop with NumPy). Most of the code is rewritten in Rust and is on par or even faster than existing functions in SciPy and Scikit-learn. The following are some examples:

Parallel evaluations of classification metrics on segments

import polars as pl
import polars_ds as pds

df.lazy().group_by("segments").agg( 
    pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
    pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()

shape: (2, 3)
┌──────────┬──────────┬──────────┐
 segments  roc_auc   log_loss 
 ---       ---       ---      
 str       f64       f64      
╞══════════╪══════════╪══════════╡
 a         0.497745  1.006438 
 b         0.498801  0.997226 
└──────────┴──────────┴──────────┘

Get all neighbors within radius r, call them best friends, and count the number

df.select(
    pl.col("id"),
    pds.query_radius_ptwise(
        pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in 3d space
        index = pl.col("id"),
        r = 0.1, 
        dist = "sql2", # squared l2
        parallel = True
    ).alias("best friends"),
).with_columns( # -1 to remove the point itself
    (pl.col("best friends").list.len() - 1).alias("best friends count")
).head()

shape: (5, 3)
┌─────┬───────────────────┬────────────────────┐
 id   best friends       best friends count 
 ---  ---                ---                
 u32  list[u32]          u32                
╞═════╪═══════════════════╪════════════════════╡
 0    [0, 811,  1435]   152                
 1    [1, 953,  1723]   159                
 2    [2, 355,  835]    243                
 3    [3, 102,  1129]   110                
 4    [4, 1280,  1543]  226                
└─────┴───────────────────┴────────────────────┘

Ridge Regression on Categories

df = pds.random_data(size=5_000, n_cols=0).select(
    pds.random(0.0, 1.0).alias("x1"),
    pds.random(0.0, 1.0).alias("x2"),
    pds.random(0.0, 1.0).alias("x3"),
    pds.random_int(0, 3).alias("categories")
).with_columns(
    y = pl.col("x1") * 0.5 + pl.col("x2") * 0.25 - pl.col("x3") * 0.15 + pds.random() * 0.0001
)

df.group_by("categories").agg(
    pds.query_lstsq(
        "x1", "x2", "x3", 
        target = "y",
        method = "l2",
        l2_reg = 0.05,
        add_bias = False
    ).alias("coeffs")
) 

shape: (3, 2)
┌────────────┬─────────────────────────────────┐
 categories  coeffs                          
 ---         ---                             
 i32         list[f64]                       
╞════════════╪═════════════════════════════════╡
 0           [0.499912, 0.250005, -0.149846 
 1           [0.499922, 0.250004, -0.149856 
 2           [0.499923, 0.250004, -0.149855 
└────────────┴─────────────────────────────────┘

Various String Edit distances

df.select( # Column "word", compared to string in pl.lit(). It also supports column vs column comparison
    pds.str_leven("word", pl.lit("asasasa"), return_sim=True).alias("Levenshtein"),
    pds.str_osa("word", pl.lit("apples"), return_sim=True).alias("Optimal String Alignment"),
    pds.str_jw("word", pl.lit("apples")).alias("Jaro-Winkler"),
)

In-dataframe statistical tests

df.group_by("market_id").agg(
    pds.query_ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
    pds.query_chi2("category_1", "category_2").alias("chi2-test"),
    pds.query_f_test("var1", group = "category_1").alias("f-test")
)

shape: (3, 4)
┌───────────┬──────────────────────┬──────────────────────┬─────────────────────┐
 market_id  t-test                chi2-test             f-test              
 ---        ---                   ---                   ---                 
 i64        struct[2]             struct[2]             struct[2]           
╞═══════════╪══════════════════════╪══════════════════════╪═════════════════════╡
 0          {2.072749,0.038272}   {33.487634,0.588673}  {0.312367,0.869842} 
 1          {0.469946,0.638424}   {42.672477,0.206119}  {2.148937,0.072536} 
 2          {-1.175325,0.239949}  {28.55723,0.806758}   {0.506678,0.730849} 
└───────────┴──────────────────────┴──────────────────────┴─────────────────────┘

Multiple Convolutions at once!

# Multiple Convolutions at once
# Modes: `same`, `left` (left-aligned same), `right` (right-aligned same), `valid` or `full`
# Method: `fft`, `direct`
# Currently slower than SciPy but provides parallelism because of Polars
df.select(
    pds.convolve("f", [-1, 0, 0, 0, 1], mode = "full", method = "fft"), # column f with the kernel given here
    pds.convolve("a", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
    pds.convolve("b", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
).head()

Tabular Machine Learning Data Transformation Pipeline

import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint

bp = (
    # If we specify a target, then target will be excluded from any transformations.
    Blueprint(df, name = "example", target = "approved") 
    .lowercase() # lowercase all columns
    .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
    # Impute loan_period by running a simple linear regression. 
    # Explicitly put target, since this is not the target for prediction. 
    .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
    .impute(["existing_emi"], method = "median")
    .append_expr( # generate some features
        pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
        pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
        pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
        pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
    )
    .scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
        cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
    ) # Scale the columns up to this point. The columns below won't be scaled
    .append_expr(
        # Add missing flags
        pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
    )
    .one_hot_encode("gender", drop_first=True)
    .woe_encode("city_category") # No need to specify target because we initialized bp with a target
    .target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)

pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks!
df_transformed = pipe.transform(df)
df_transformed.head()

And more!

Getting Started

import polars_ds as pds

To make full use of the Diagnosis module, do

pip install "polars_ds[plot]"

More Examples

See this for Polars Extensions: notebook

See this for Native Polars DataFrame Explorative tools: notebook

HELP WANTED!

  1. Documentation writing, Doc Review, and Benchmark preparation

Road Map

  1. Standalone KNN and linear regression module.
  2. K-means, K-medoids clustering as expressions and also standalone modules.
  3. Other.

Disclaimer

Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars and will try to be as stable as possible for polars>=0.20.6. Exceptions will be made when Polars's update forces changes in the plugins.

This package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed.

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT) and internalized. See here
  3. Graph functionalities are powered by the petgragh crate. See here
  4. Linear algebra routines are powered partly by faer

Other related Projects

  1. Take a look at our friendly neighbor functime
  2. String similarity metrics is soooo fast and easy to use because of RapidFuzz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.5.1.tar.gz (2.1 MB view details)

Uploaded Source

Built Distributions

polars_ds-0.5.1-cp38-abi3-win_amd64.whl (13.8 MB view details)

Uploaded CPython 3.8+ Windows x86-64

polars_ds-0.5.1-cp38-abi3-manylinux_2_24_aarch64.whl (11.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.24+ ARM64

polars_ds-0.5.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_ds-0.5.1-cp38-abi3-macosx_11_0_arm64.whl (10.8 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_ds-0.5.1-cp38-abi3-macosx_10_12_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_ds-0.5.1.tar.gz.

File metadata

  • Download URL: polars_ds-0.5.1.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.0

File hashes

Hashes for polars_ds-0.5.1.tar.gz
Algorithm Hash digest
SHA256 80f864e75228e41bf4c8dbb28e1690f7dc0064737e7cecd20ed6fd9539bd565c
MD5 e2a47d9e15bc07950f05cc452f5bf87f
BLAKE2b-256 ec2470d3ca4c7c4d8aa97f2c521a831a3099b4b975f4588d61d1869a79527e54

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a27a0bad574a3257f9dfd6e5b9047d68d6ddb7246a69667f34c3cc9987de249f
MD5 7bfe1684de65a8cced8e26b58a8a753b
BLAKE2b-256 f6e4b27b0c78624fd2de487bd7e8a3e3bfc2deea8ff4f413f26814a978dbcaee

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.1-cp38-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.1-cp38-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 c9f6b1564c72cc442b252a8c7c0a228ee8e16343f8cc9fefff04dc0bf4d4a43c
MD5 810bd60a1351570c6f462fa12423931b
BLAKE2b-256 51007dbc60f015033cdbd7a01a4991421887b0c6550c6597f500e217e14db296

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a5d6edcf1b9bae3f5cdac3abe24d47be8009dceab273905d9ff4c52d74f14ef0
MD5 f6d347d1bb544548381d617670d50ff6
BLAKE2b-256 984bbce01e4f75b050fdded1bbe7441a18fc9d97ab7c3a406bb748bb4a21536f

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 09a32bf306622d12e015f98e85b3dcb0d750a98793a8e0e5381473d15e5928e0
MD5 c8acfc262d2fc00e43804890eacb16bc
BLAKE2b-256 22adba80a8131ed16f06b77be7a7fa823aa4b72d13539cab447a049c3323fcca

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 dea57abdcb82279e34469085b8b00e8634018d6450512def42115d1371bd61ce
MD5 59760d1d8855f598b6c914e19ffa078e
BLAKE2b-256 6ff316d3ec4773b87cd904ddbdab3258bba18f47c41e7bccddeef6634bdb04e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page