Skip to main content

No project description provided

Project description

Polars for Data Science

Discord | Documentation | User Guide | Want to Contribute?
pip install polars-ds

The Project

The goal of the project is to reduce dependencies, improve code organization, simplify data pipelines and overall faciliate analysis of various kinds of tabular data that a data scientist may encounter. It is a package built around your favorite Polars dataframe. Here are some of the main areas of data science that is covered by the package:

  1. Well-known numerical transform/quantities. E.g. fft, conditional entropy, singular values, basic linear regression related quantities, population stability index, weight of evidence, column-wise/row-wise jaccard similarity etc.

  2. Statistics. Basic tests such as the t-test, f-test, KS statistics. Miscallaneous functions like weighted correlation, Xi-correlation. In-dataframe random column generations, etc.

  3. Metrics. ML metrics for common model performance reporting. E.g ROC AUC for binary/multiclass classification, logloss, r2, MAPE, etc.

  4. KNN-related queries. E.g. filter to k-nearest neighbors to point, find indices of all neighbors within a certain distance, etc.

  5. String metrics such as Levenshtein distance, Damure Levenshtein distance, other string distances, snowball stemming (English only), string Jaccard similarity, etc.

  6. Diagnosis. This modules contains the DIA (Data Inspection Assitant) class, which can help you profile your data, visualize data in lower dimensions, detect functional dependencies, detect other common data quality issues like null rate or high correlation. (Need plotly, great_tables, graphviz as optional dependencies.)

  7. Sample. Traditional dataset sampling. No time series sampling yet. This module provides functionalities such as stratified downsample, volume neutral random sampling, etc.

  8. Polars Native ML Pipeline. See examples here. The goal is to have a Polars native pipeline that can replace Scikit-learn's pipeline and provides all the benefits of Polars. All the basic transforms in Scikit-learn and categorical-encoders are planned. This can be super powerful together with Polars's expressions. (Basically, once you have expressions, you don't need to write custom transforms like col(A)/col(B), log transform, sqrt transform, linear/polynomial transforms, etc.) Polar's expressions also offer JSON serialization in higher versions so this can also be desirable for use in the cloud. (This part is under active development.)

Some other areas that currently exist, but is de-prioritized:

  1. Complex number related queries. (Will be removed in v0.5)

  2. Graph related queries. (The various representations of "Graphs" in tabular dataframe makes it hard to have consistent backend handling of such data.)

But why? Why not use Sklearn? SciPy? NumPy?

The goal of the package is to facilitate data processes and analysis that go beyond standard SQL queries, and to reduce the number of dependencies in your project. It incorproates parts of SciPy, NumPy, Scikit-learn, and NLP (NLTK), etc., and treats them as Polars queries so that they can be run in parallel, in group_by contexts, all for almost no extra engineering effort.

Let's see an example. Say we want to generate a model performance report. In our data, we have segments. We are not only interested in the ROC AUC of our model on the entire dataset, but we are also interested in the model's performance on different segments.

import polars as pl
import polars_ds as pds

size = 100_000
df = pl.DataFrame({
    "a": np.random.random(size = size)
    , "b": np.random.random(size = size)
    , "x1" : range(size)
    , "x2" : range(size, size + size)
    , "y": range(-size, 0)
    , "actual": np.round(np.random.random(size=size)).astype(np.int32)
    , "predicted": np.random.random(size=size)
    , "segments":["a"] * (size//2 + 100) + ["b"] * (size//2 - 100)
})
print(df.head())

shape: (5, 8)
┌──────────┬──────────┬─────┬────────┬─────────┬────────┬───────────┬──────────┐
 a         b         x1   x2      y        actual  predicted  segments 
 ---       ---       ---  ---     ---      ---     ---        ---      
 f64       f64       i64  i64     i64      i32     f64        str      
╞══════════╪══════════╪═════╪════════╪═════════╪════════╪═══════════╪══════════╡
 0.19483   0.457516  0    100000  -100000  0       0.929007   a        
 0.396265  0.833535  1    100001  -99999   1       0.103915   a        
 0.800558  0.030437  2    100002  -99998   1       0.558918   a        
 0.608023  0.411389  3    100003  -99997   1       0.883684   a        
 0.847527  0.506504  4    100004  -99996   1       0.070269   a        
└──────────┴──────────┴─────┴────────┴─────────┴────────┴───────────┴──────────┘

Traditionally, using the Pandas + Sklearn stack, we would do:

import pandas as pd
from sklearn.metrics import roc_auc_score

df_pd = df.to_pandas()

segments = []
rocaucs = []

for (segment, subdf) in df_pd.groupby("segments"):
    segments.append(segment)
    rocaucs.append(
        roc_auc_score(subdf["actual"], subdf["predicted"])
    )

report = pd.DataFrame({
    "segments": segments,
    "roc_auc": rocaucs
})
print(report)

  segments   roc_auc
0        a  0.497745
1        b  0.498801

This is ok, but not great, because (1) we are running for loops in Python, which tends to be slow. (2) We are writing more Python code, which leaves more room for errors in bigger projects. (3) The code is not very intuitive for beginners. Using Polars + Polars ds, one can do the following:

df.lazy().group_by("segments").agg(
    pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
    pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()

shape: (2, 3)
┌──────────┬──────────┬──────────┐
 segments  roc_auc   log_loss 
 ---       ---       ---      
 str       f64       f64      
╞══════════╪══════════╪══════════╡
 a         0.497745  1.006438 
 b         0.498801  0.997226 
└──────────┴──────────┴──────────┘

Notice a few things: (1) Computing ROC AUC on different segments is equivalent to an aggregation on segments! It is a concept everyone who knows SQL (aka everybody who works with data) will be familiar with! (2) There is no Python code. The extension is written in pure Rust and all complexities are hidden away from the end user. (3) Because Polars provides parallel execution for free, we can compute ROC AUC and log loss simultaneously on each segment!

The end result is simpler, more intuitive code that is also easier to reason about, and faster execution time. Because of Polars's extension (plugin) system, we are now blessed with both:

Performance and elegance - something that is quite rare in the Python world.

But Pandas can do it too...?

Experienced Pandas users will notice that one can do somthing similar in Pandas

from sklearn.metrics import roc_auc_score, log_loss
import pandas as pd

(
    df_pd
    .groupby("segments")
    .apply(
        lambda df2: pd.Series({
            "roc_auc" : roc_auc_score(df2["actual"], df2["predicted"]),
            "logloss" : log_loss(df2["actual"], df2["predicted"])
        })
    )
)

I would argue that the code above has two problems: (1) aesthetically ugly and verbose, which leads to higher chance of mistakes, and (2) terrible performance on large data.

What does apply mean? What we want to do is some aggregation over a group, a very natural SQL concept. The extra lingo is not only hard to remember but also confusing. In addition to apply, Pandas also provides agg, assign, transform, which are all different, and which make the API harder for beginners for no good reason.

The use of lambda function also introduces additional symbols like : and braces, which often leads to errors like unbalanced braces. Lastly, using lambda or any custom Python function will blocks the possibility of executing the code in parallel, because the thread needs to acquire the GIL before it can do the work.

Getting Started

import polars_ds as pds

To make full use of the Diagnosis module, do

pip install "polars_ds[plot]"

Examples

See this for Polars Extensions: notebook

See this for Native Polars DataFrame Explorative tools: notebook

Disclaimer

Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars and will try to be as stable as possible for polars>=0.20.6. Exceptions will be made when Polars's update forces changes in the plugins.

This package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed.

The recommended usage will be for datasets of size 1k to 2-3mm rows, but actual performance will vary depending on dataset and hardware. Performance will only be a priority for datasets that fit in memory. It is a known fact that knn performance suffers greatly with a large k. Str-knn and Graph queries are only suitable for smaller data, of size ~1-5k for common computers.

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT) and internalized. See here
  3. Graph functionalities are powered by the petgragh crate. See here
  4. Linear algebra routines are powered partly by faer

Other related Projects

  1. Take a look at our friendly neighbor functime
  2. String similarity metrics is soooo fast and easy to use because of RapidFuzz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.5.0.tar.gz (2.4 MB view details)

Uploaded Source

Built Distributions

polars_ds-0.5.0-cp38-abi3-win_amd64.whl (14.8 MB view details)

Uploaded CPython 3.8+ Windows x86-64

polars_ds-0.5.0-cp38-abi3-manylinux_2_24_aarch64.whl (12.2 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.24+ ARM64

polars_ds-0.5.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_ds-0.5.0-cp38-abi3-macosx_11_0_arm64.whl (11.7 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_ds-0.5.0-cp38-abi3-macosx_10_12_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_ds-0.5.0.tar.gz.

File metadata

  • Download URL: polars_ds-0.5.0.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.0

File hashes

Hashes for polars_ds-0.5.0.tar.gz
Algorithm Hash digest
SHA256 45155e1e610185af45e77385be27a2b79e900d47a9a98a70ac740a2b3d587fc7
MD5 47f5204a4ceba74068cd04e7a2bf2699
BLAKE2b-256 c93b43424f08ec65d5ef44f1612d5ccfdfa2e737cba1f6a9836b6dc5f7e6d324

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1b1c64fc13e9634c2897bead88e032234ba2769a797a42eac46ff0e57bd4ada9
MD5 ce26fa95840c86fcacc5c5c03e04e493
BLAKE2b-256 5267442443640449b315a9162e3577ad4dc0e3026e560c88554bfd57746f3778

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.0-cp38-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.0-cp38-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 16d7bab4d106dc262905c5da8678f143509f2517bc6252eeb02c26d841fa1a1f
MD5 e98d628c7d0bf536f45e60bf07469fc1
BLAKE2b-256 8ccafbf3805e95707577b78d2da8e768c654d11a62aea11467ace6493ff5048b

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f7a29a841dc251ed0929ed70110b7664d272960cc54a66aebdf1d11e47d8b5a8
MD5 dea960e4930f073b32e98cb1af1e7f87
BLAKE2b-256 407f9ba3f7a7e4155ea425723e2c36d1cca428a854c27718c98a2e2e0f43be2c

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 46681cac32de62173b4d0191e2895047dfd802af2ea9629830253c01ec269269
MD5 6fa413aac16741eb92f59e9c31ca8494
BLAKE2b-256 fa01f346258c9a7334b273d822c9764d4a788bafa07c20dda87fd9119f55cd30

See more details on using hashes here.

File details

Details for the file polars_ds-0.5.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.5.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e7c405e376a8d2299d1cacb5fcbdfda2b2d098fd79ac3f8a68a9ba6bb9d9e072
MD5 34220e58e4a6fe9ff74bb3e35dcf2cb6
BLAKE2b-256 c224e52d69ddf77acf96e1667605d8a3481ab7dcb1f733de30b80ac8e9b47f68

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page