Skip to main content

No project description provided

Project description

Polars Extension for General Data Science Use

Currently in Alpha. Feel free to submit feature requests in the issues section of the repo.

The goal for this package is to provide data scientists/analysts/engineers/quants more tools to manipulate, transform, and make sense of data, without the need to leave DataFrame land (aka Wonderland).

This package will also be a "lower level" backend for another package of mine called dsds. See here. This package will change the ways of how many functions work in dsds.

Performance is a focus, but sometimes it's impossible to beat NumPy/SciPy performance for a single operation on a single array. There can be many reasons: Interop cost (sometimes copies needed), null checks, lack of support for complex number (e.g We have to do multiple copies in the FFT implementation), or we haven't found the most optimized way to write some algorithm, etc.

However, there are greater benefits for staying in DataFrame land:

  1. Works with Polars expression engine and more expressions can be executed in parallel. E.g. running fft for 1 series may be slower than NumPy, but if you are running some fft, together with some other non-trivial operations, the story changes completely.
  2. Works in group_by context. E.g. run multiple linear regressions in parallel in a group_by context.
  3. Staying in DataFrame land typically keeps code cleaner and less confusing.

Some examples:

df.group_by("dummy").agg(
    pl.col("y").num_ext.lstsq(pl.col("a"), pl.col("b"), add_bias = True).alias("list_float")
)

shape: (2, 2)
┌───────┬─────────────┐
 dummy  list_float  
 ---    ---         
 str    list[f64]   
╞═══════╪═════════════╡
 b      [2.0, -1.0] 
 a      [2.0, -1.0] 
└───────┴─────────────┘

df.group_by("dummy_groups").agg(
    pl.col("actual").num_ext.l2_loss(pl.col("predicted")).alias("l2"),
    pl.col("actual").num_ext.bce(pl.col("predicted")).alias("log loss"),
    pl.col("actual").num_ext.roc_auc(pl.col("predicted")).alias("roc_auc")
)

shape: (2, 4)
┌──────────────┬──────────┬──────────┬──────────┐
 dummy_groups  l2        log loss  roc_auc  
 ---           ---       ---       ---      
 str           f64       f64       f64      
╞══════════════╪══════════╪══════════╪══════════╡
 b             0.333887  0.999602  0.498913 
 a             0.332575  0.997049  0.501997 
└──────────────┴──────────┴──────────┴──────────┘

To avoid Chunked array is not contiguous error, try to rechunk your dataframe.

The package right now contains two extensions:

Numeric Extension

Existing Features

  1. GCD, LCM for integers
  2. harmonic mean, geometric mean, other common, simple metrics used in industry.
  3. Common loss functions, e.g. L1, L2, L infinity, huber loss, MAPE, SMAPE, wMAPE, etc.
  4. Common mini-models, lstsq, condition entropy.
  5. Discrete Fourier Transform, returning the real and complex part of the new series.

String Extension

Existing Features

  1. Levenshtein distance, Hamming distance, str Jaccard similarity
  2. Simple Tokenize
  3. Stemming (Right now only Snowball stemmer for English)

Todo list

  1. Longest common subsequence as string distance metric
  2. Vectorizers (Count + TFIDF)?
  3. Similarity version of the distances, and more variations and parameters.

Other Extensions ?

E.g. stats_ext, dist_ext (L^p distance for vectors (scalar version is implemented) etc.) etc.

Simple unsupervised clusters can also be done. It is simply a matter of willingness and market demand.

Disclaimer

Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.1.1.tar.gz (50.3 kB view hashes)

Uploaded Source

Built Distributions

polars_ds-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-cp312-none-win32.whl (8.2 MB view hashes)

Uploaded CPython 3.12 Windows x86

polars_ds-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-cp312-cp312-macosx_10_12_x86_64.whl (9.4 MB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

polars_ds-0.1.1-cp311-none-win32.whl (8.2 MB view hashes)

Uploaded CPython 3.11 Windows x86

polars_ds-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-cp311-cp311-macosx_10_12_x86_64.whl (9.4 MB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

polars_ds-0.1.1-cp310-none-win32.whl (8.2 MB view hashes)

Uploaded CPython 3.10 Windows x86

polars_ds-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-cp310-cp310-macosx_10_12_x86_64.whl (9.4 MB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

polars_ds-0.1.1-cp39-none-win32.whl (8.2 MB view hashes)

Uploaded CPython 3.9 Windows x86

polars_ds-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

polars_ds-0.1.1-cp38-none-win32.whl (8.2 MB view hashes)

Uploaded CPython 3.8 Windows x86

polars_ds-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

polars_ds-0.1.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page