Skip to main content

No project description provided

Project description

Polars Extension for General Data Science Use

A Polars Plugin aiming to simplify common numerical/string data analysis procedures. This means that the most basic data science, stats, NLP related tasks can be done natively inside a dataframe. Its goal is not to replace SciPy, or NumPy, but rather it tries reduce dependency for common workflows and simple analysis, and tries to reduce Python side code and UDFs.

Currently in Alpha. Feel free to submit feature requests in the issues section of the repo.

This package will also be a "lower level" backend for another package of mine called dsds. See here.

Performance is a focus, but sometimes it's impossible to beat NumPy/SciPy performance for a single operation on a single array. There can be many reasons: Interop cost (sometimes copies needed), null checks, lack of support for complex number (e.g We have to do multiple copies in the FFT implementation), or we haven't found the most optimized way to write some algorithm, etc.

However, there are greater benefits for staying in DataFrame land:

  1. Works with Polars expression engine and more expressions can be executed in parallel. E.g. running fft for 1 series may be slower than NumPy, but if you are running some fft, together with some other non-trivial operations, the story changes completely.
  2. Works in group_by context. E.g. run multiple linear regressions in parallel in a group_by context.
  3. Staying in DataFrame land typically keeps code cleaner and less confusing.

Some examples:

df.group_by("dummy").agg(
    pl.col("y").num_ext.lstsq(pl.col("a"), pl.col("b"), add_bias = False).alias("list_float")
)

shape: (2, 2)
┌───────┬─────────────┐
 dummy  list_float  
 ---    ---         
 str    list[f64]   
╞═══════╪═════════════╡
 b      [2.0, -1.0] 
 a      [2.0, -1.0] 
└───────┴─────────────┘

df.group_by("dummy_groups").agg(
    pl.col("actual").num_ext.l2_loss(pl.col("predicted")).alias("l2"),
    pl.col("actual").num_ext.bce(pl.col("predicted")).alias("log loss"),
    pl.col("actual").num_ext.roc_auc(pl.col("predicted")).alias("roc_auc")
)

shape: (2, 4)
┌──────────────┬──────────┬──────────┬──────────┐
 dummy_groups  l2        log loss  roc_auc  
 ---           ---       ---       ---      
 str           f64       f64       f64      
╞══════════════╪══════════╪══════════╪══════════╡
 b             0.333887  0.999602  0.498913 
 a             0.332575  0.997049  0.501997 
└──────────────┴──────────┴──────────┴──────────┘

To avoid Chunked array is not contiguous error, try to rechunk your dataframe.

The package right now contains two extensions:

Numeric Extension

Existing Features

  1. GCD, LCM for integers
  2. harmonic mean, geometric mean, other common, simple metrics used in industry.
  3. Common loss functions, e.g. L1, L2, L infinity, huber loss, MAPE, SMAPE, wMAPE, etc.
  4. Common mini-models, lstsq, condition entropy.
  5. Discrete Fourier Transform, returning the real and complex part of the new series.
  6. ROC AUC, precision, recall, F, average precision, all as expressions.

String Extension

Existing Features

  1. Levenshtein distance + similarity, Hamming distance, Jaro similarity, Str Jaccard simiarlity, Sorensen dice similarity, overlap coefficient
  2. Simple tokenize, snowball stemming,
  3. Frequency based merging, inferral, and removal.
  4. Aho-Corasick matching, replacing multiple patterns.

Plans?

  1. Some more string similarity like: https://www.postgresql.org/docs/9.1/pgtrgm.html

Other Extensions ?

More stats, clustering, etc. It is simply a matter of willingness and market demand.

Future Plans

I am open to make this package a Python frontend for other machine learning processes/models with Rust packages at the backend. There are some very interesting packages to incorporate, such as k-medoids. But I do want to stick with Faer as a Rust linear algebra backend and I do want to keep it simple for now.

Right now most str similarity/dist is dependent on the strsim crate, which is no longer maintained and has some very old code. The current plan is to keep it for now and maybe replace it with higher performance code later (if there is the need to do so).

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT). See here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.1.3.tar.gz (77.2 kB view hashes)

Uploaded Source

Built Distributions

polars_ds-0.1.3-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-cp312-none-win_amd64.whl (9.7 MB view hashes)

Uploaded CPython 3.12 Windows x86-64

polars_ds-0.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-cp312-cp312-macosx_11_0_arm64.whl (8.3 MB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

polars_ds-0.1.3-cp312-cp312-macosx_10_12_x86_64.whl (9.8 MB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

polars_ds-0.1.3-cp311-none-win_amd64.whl (9.7 MB view hashes)

Uploaded CPython 3.11 Windows x86-64

polars_ds-0.1.3-cp311-none-win32.whl (8.6 MB view hashes)

Uploaded CPython 3.11 Windows x86

polars_ds-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-cp311-cp311-macosx_11_0_arm64.whl (8.3 MB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

polars_ds-0.1.3-cp311-cp311-macosx_10_12_x86_64.whl (9.8 MB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

polars_ds-0.1.3-cp310-none-win_amd64.whl (9.7 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

polars_ds-0.1.3-cp310-none-win32.whl (8.6 MB view hashes)

Uploaded CPython 3.10 Windows x86

polars_ds-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-cp310-cp310-macosx_11_0_arm64.whl (8.3 MB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

polars_ds-0.1.3-cp310-cp310-macosx_10_12_x86_64.whl (9.8 MB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

polars_ds-0.1.3-cp39-none-win_amd64.whl (9.7 MB view hashes)

Uploaded CPython 3.9 Windows x86-64

polars_ds-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.3-cp38-none-win_amd64.whl (9.7 MB view hashes)

Uploaded CPython 3.8 Windows x86-64

polars_ds-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.5 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (9.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page