Skip to main content

No project description provided

Project description

Polars for Data Science

Discord | Documentation | User Guide | Want to Contribute?
pip install polars-ds

The Project

PDS is a modern take on data science and traditional tabular machine learning. It is dataframe-centric in design, and provides parallelism for free via Polars. It offers Polars syntax that works both in normal and aggregation contexts, and provides these conveniences to the end user without any additional dependency. It includes the most common functions from NumPy, SciPy, edit distances, KNN-related queries, EDA tools, feature engineering queries, etc. Yes, it only depends on Polars (unless you want to use the plotting functionalities and want to interop with NumPy). Most of the code is rewritten in Rust and is on par or even faster than existing functions in SciPy and Scikit-learn. The following are some examples:

Parallel evaluations of classification metrics on segments

import polars as pl
import polars_ds as pds

df.lazy().group_by("segments").agg( 
    pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
    pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()

shape: (2, 3)
┌──────────┬──────────┬──────────┐
 segments  roc_auc   log_loss 
 ---       ---       ---      
 str       f64       f64      
╞══════════╪══════════╪══════════╡
 a         0.497745  1.006438 
 b         0.498801  0.997226 
└──────────┴──────────┴──────────┘

Get all neighbors within radius r, call them best friends, and count the number

df.select(
    pl.col("id"),
    pds.query_radius_ptwise(
        pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in 3d space
        index = pl.col("id"),
        r = 0.1, 
        dist = "sql2", # squared l2
        parallel = True
    ).alias("best friends"),
).with_columns( # -1 to remove the point itself
    (pl.col("best friends").list.len() - 1).alias("best friends count")
).head()

shape: (5, 3)
┌─────┬───────────────────┬────────────────────┐
 id   best friends       best friends count 
 ---  ---                ---                
 u32  list[u32]          u32                
╞═════╪═══════════════════╪════════════════════╡
 0    [0, 811,  1435]   152                
 1    [1, 953,  1723]   159                
 2    [2, 355,  835]    243                
 3    [3, 102,  1129]   110                
 4    [4, 1280,  1543]  226                
└─────┴───────────────────┴────────────────────┘

Ridge Regression on Categories

df = pds.random_data(size=5_000, n_cols=0).select(
    pds.random(0.0, 1.0).alias("x1"),
    pds.random(0.0, 1.0).alias("x2"),
    pds.random(0.0, 1.0).alias("x3"),
    pds.random_int(0, 3).alias("categories")
).with_columns(
    y = pl.col("x1") * 0.5 + pl.col("x2") * 0.25 - pl.col("x3") * 0.15 + pds.random() * 0.0001
)

df.group_by("categories").agg(
    pds.query_lstsq(
        "x1", "x2", "x3", 
        target = "y",
        method = "l2",
        l2_reg = 0.05,
        add_bias = False
    ).alias("coeffs")
) 

shape: (3, 2)
┌────────────┬─────────────────────────────────┐
 categories  coeffs                          
 ---         ---                             
 i32         list[f64]                       
╞════════════╪═════════════════════════════════╡
 0           [0.499912, 0.250005, -0.149846 
 1           [0.499922, 0.250004, -0.149856 
 2           [0.499923, 0.250004, -0.149855 
└────────────┴─────────────────────────────────┘

Various String Edit distances

df.select( # Column "word", compared to string in pl.lit(). It also supports column vs column comparison
    pds.str_leven("word", pl.lit("asasasa"), return_sim=True).alias("Levenshtein"),
    pds.str_osa("word", pl.lit("apples"), return_sim=True).alias("Optimal String Alignment"),
    pds.str_jw("word", pl.lit("apples")).alias("Jaro-Winkler"),
)

In-dataframe statistical tests

df.group_by("market_id").agg(
    pds.query_ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
    pds.query_chi2("category_1", "category_2").alias("chi2-test"),
    pds.query_f_test("var1", group = "category_1").alias("f-test")
)

shape: (3, 4)
┌───────────┬──────────────────────┬──────────────────────┬─────────────────────┐
 market_id  t-test                chi2-test             f-test              
 ---        ---                   ---                   ---                 
 i64        struct[2]             struct[2]             struct[2]           
╞═══════════╪══════════════════════╪══════════════════════╪═════════════════════╡
 0          {2.072749,0.038272}   {33.487634,0.588673}  {0.312367,0.869842} 
 1          {0.469946,0.638424}   {42.672477,0.206119}  {2.148937,0.072536} 
 2          {-1.175325,0.239949}  {28.55723,0.806758}   {0.506678,0.730849} 
└───────────┴──────────────────────┴──────────────────────┴─────────────────────┘

Multiple Convolutions at once!

# Multiple Convolutions at once
# Modes: `same`, `left` (left-aligned same), `right` (right-aligned same), `valid` or `full`
# Method: `fft`, `direct`
# Currently slower than SciPy but provides parallelism because of Polars
df.select(
    pds.convolve("f", [-1, 0, 0, 0, 1], mode = "full", method = "fft"), # column f with the kernel given here
    pds.convolve("a", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
    pds.convolve("b", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
).head()

Tabular Machine Learning Data Transformation Pipeline

import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint

bp = (
    # If we specify a target, then target will be excluded from any transformations.
    Blueprint(df, name = "example", target = "approved") 
    .lowercase() # lowercase all columns
    .select(cs.numeric() | cs.by_name(["gender", "employer_category1", "city_category"]))
    # Impute loan_period by running a simple linear regression. 
    # Explicitly put target, since this is not the target for prediction. 
    .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
    .impute(["existing_emi"], method = "median")
    .append_expr( # generate some features
        pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
        pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
        pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
        pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
    )
    .scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
        cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
    ) # Scale the columns up to this point. The columns below won't be scaled
    .append_expr(
        # Add missing flags
        pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
    )
    .one_hot_encode("gender", drop_first=True)
    .woe_encode("city_category") # No need to specify target because we initialized bp with a target
    .target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)

pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks!
df_transformed = pipe.transform(df)
df_transformed.head()

And more!

Getting Started

import polars_ds as pds

To make full use of the Diagnosis module, do

pip install "polars_ds[plot]"

More Examples

See this for Polars Extensions: notebook

See this for Native Polars DataFrame Explorative tools: notebook

HELP WANTED!

  1. Documentation writing, Doc Review, and Benchmark preparation

Road Map

  1. Standalone KNN and linear regression module.
  2. K-means, K-medoids clustering as expressions and also standalone modules.
  3. Other.

Disclaimer

Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars and will try to be as stable as possible for polars>=0.20.6. Exceptions will be made when Polars's update forces changes in the plugins.

This package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed.

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT) and internalized. See here
  3. Graph functionalities are powered by the petgragh crate. See here
  4. Linear algebra routines are powered partly by faer

Other related Projects

  1. Take a look at our friendly neighbor functime
  2. String similarity metrics is soooo fast and easy to use because of RapidFuzz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.5.2.tar.gz (2.1 MB view details)

Uploaded Source

Built Distributions

polars_ds-0.5.2-cp38-abi3-win_amd64.whl (13.9 MB view details)

Uploaded CPython 3.8+ Windows x86-64

polars_ds-0.5.2-cp38-abi3-manylinux_2_24_aarch64.whl (11.3 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.24+ ARM64

polars_ds-0.5.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_ds-0.5.2-cp38-abi3-macosx_11_0_arm64.whl (10.9 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_ds-0.5.2-cp38-abi3-macosx_10_12_x86_64.whl (12.3 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_ds-0.5.2.tar.gz.

File metadata

  • Download URL: polars_ds-0.5.2.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.0

File hashes

Hashes for polars_ds-0.5.2.tar.gz
Algorithm Hash digest
SHA256 8ac690e81e6b94a391eaec60f4bb85dfdabeaf8974beb375c40e3ecb9d2f56e1
MD5 b862299dc4e80af1ff53465d237366e1
BLAKE2b-256 c1b9c36130ef4baf57a07e7e5032f8e97b73665b29573f8920ae9983f821d502

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.2-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.2-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 591b086af830a8c3bd557e939c22d983f37e3b3ed68a23871c0c6d063cb8bad3
MD5 22c4f261a8ca2308c33f8c825c9922da
BLAKE2b-256 2e07bac9ff47fb0e99643e632613e12c328482e74ec9806ad286faec99cbef12

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.2-cp38-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.2-cp38-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 fc7c4f3fb199a605aacb1b20580f4cab692d0e68d612185495b45c84ef3ffeb8
MD5 1f8ad21573615967a6c800f2d574f5d7
BLAKE2b-256 1840f8dcd335abe3947414ad49f0615bc4c86ae2adf9533831d9c0c12da064c3

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2e10332da5a65b6f82e13a7dadd3a2feb58cad1f14dbcf76943db0fd6a2f3824
MD5 a07e075e3a2a2b6aa8cd4665cbc3591e
BLAKE2b-256 4f9d31f16fce3bf2a45ae4487f4dea743b546a8d289d48dd01b9d1136316f2ad

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9d5df466b150918be9da4802935047772be769740fd14a089e3d5a3e3341d17c
MD5 a1cbadcccd6aef3fa3ababb04fb63572
BLAKE2b-256 09007d13483373375ecb330068f65d18ce9524d144d6f6a458bbec59d933865a

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.2-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2ea52ecb18ecd334f679c4b2c65618b212b8154dccd387fede8d3d7009c1a43b
MD5 1fc07e0616a9431bcf87c885c350de08
BLAKE2b-256 6f1719b68a1cb919f53d20c10c928dd374c99130f761c5a43f61c7a0c2b5897b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page