Skip to main content

No project description provided

Project description

Polars Extension for General Data Science Use

A Polars Plugin aiming to simplify common numerical/string data analysis procedures. This means that the most basic data science, stats, NLP related tasks can be done natively inside a dataframe, thus minimizing the number of dependencies.

Its goal is not to replace SciPy, or NumPy, but rather it tries to improve runtime for common tasks, reduce Python code and UDFs.

See examples here.

Read the docs here.

Currently in Beta. Feel free to submit feature requests in the issues section of the repo.

Disclaimer: this plugin is not tested with streaming mode.

Getting Started

pip install polars_ds

and

import polars_ds as pld

when you want to use the namespaces provided by the package.

Examples

In-dataframe statistical testing

df.select(
    pl.col("group1").stats.ttest_ind(pl.col("group2"), equal_var = True).alias("t-test"),
    pl.col("category_1").stats.chi2(pl.col("category_2")).alias("chi2-test"),
    pl.col("category_1").stats.f_test(pl.col("group1")).alias("f-test")
)

shape: (1, 3)
┌───────────────────┬──────────────────────┬────────────────────┐
 t-test             chi2-test             f-test             
 ---                ---                   ---                
 struct[2]          struct[2]             struct[2]          
╞═══════════════════╪══════════════════════╪════════════════════╡
 {-0.004,0.996809}  {37.823816,0.386001}  {1.354524,0.24719} 
└───────────────────┴──────────────────────┴────────────────────┘

Generating random numbers according to reference column

df.with_columns(
    # Sample from normal distribution, using reference column "a" 's mean and std
    pl.col("a").stats.sample_normal().alias("test1") 
    # Sample from uniform distribution, with low = 0 and high = "a"'s max, and respect the nulls in "a"
    , pl.col("a").stats.sample_uniform(low = 0., high = None, respect_null=True).alias("test2")
).head()

shape: (5, 3)
┌───────────┬───────────┬──────────┐
 a          test1      test2    
 ---        ---        ---      
 f64        f64        f64      
╞═══════════╪═══════════╪══════════╡
 null       0.459357   null     
 null       0.038007   null     
 -0.826518  0.241963   0.968385 
 0.737955   -0.819475  2.429615 
 1.10397    -0.684289  2.483368 
└───────────┴───────────┴──────────┘

Blazingly fast string similarity comparisons. (Thanks to RapidFuzz)

df.select(
    pl.col("word").str2.levenshtein("asasasa", return_sim=True).alias("asasasa"),
    pl.col("word").str2.levenshtein("sasaaasss", return_sim=True).alias("sasaaasss"),
    pl.col("word").str2.levenshtein("asdasadadfa", return_sim=True).alias("asdasadadfa"),
    pl.col("word").str2.fuzz("apples").alias("LCS based Fuzz match - apples"),
    pl.col("word").str2.osa("apples", return_sim = True).alias("Optimal String Alignment - apples"),
    pl.col("word").str2.jw("apples").alias("Jaro-Winkler - apples"),
)
shape: (5, 6)
┌──────────┬───────────┬─────────────┬────────────────┬───────────────────────────┬────────────────┐
 asasasa   sasaaasss  asdasadadfa  LCS based Fuzz  Optimal String Alignment   Jaro-Winkler - 
 ---       ---        ---          match - apples  - apple                   apples         
 f64       f64        f64          ---             ---                        ---            
                                   f64             f64                        f64            
╞══════════╪═══════════╪═════════════╪════════════════╪═══════════════════════════╪════════════════╡
 0.142857  0.111111   0.090909     0.833333        0.833333                   0.966667       
 0.428571  0.333333   0.272727     0.166667        0.0                        0.444444       
 0.111111  0.111111   0.090909     0.555556        0.444444                   0.5            
 0.875     0.666667   0.545455     0.25            0.25                       0.527778       
 0.75      0.777778   0.454545     0.25            0.25                       0.527778       
└──────────┴───────────┴─────────────┴────────────────┴───────────────────────────┴────────────────┘

Even in-dataframe nearest neighbors queries! 😲

df.with_columns(
    pl.col("id").num.knn_ptwise(
        pl.col("val1"), pl.col("val2"), 
        k = 3, dist = "haversine", parallel = True
    ).alias("nearest neighbor ids")
)

shape: (5, 6)
┌─────┬──────────┬──────────┬──────────┬──────────┬──────────────────────┐
 id   val1      val2      val3      val4      nearest neighbor ids 
 ---  ---       ---       ---       ---       ---                  
 i64  f64       f64       f64       f64       list[u64]            
╞═════╪══════════╪══════════╪══════════╪══════════╪══════════════════════╡
 0    0.804226  0.937055  0.401005  0.119566  [0, 3,  0]          
 1    0.526691  0.562369  0.061444  0.520291  [1, 4,  4]          
 2    0.225055  0.080344  0.425962  0.924262  [2, 1,  1]          
 3    0.697264  0.112253  0.666238  0.45823   [3, 1,  0]          
 4    0.227807  0.734995  0.225657  0.668077  [4, 4,  0]          
└─────┴──────────┴──────────┴──────────┴──────────┴──────────────────────┘

And a lot more!

Credits

  1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
  2. Some statistics functions are taken from Statrs (MIT). See here

Other related Projects

  1. Take a look at our friendly neighbor functime
  2. My other project dsds. This is currently paused because I am developing polars-ds, but some modules in DSDS, such as the diagonsis one, is quite stable.
  3. String similarity metrics is soooo fast and easy to use because of RapidFuzz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.2.2.tar.gz (101.6 kB view hashes)

Uploaded Source

Built Distributions

polars_ds-0.2.2-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.2.2-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (14.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

polars_ds-0.2.2-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

polars_ds-0.2.2-pp310-pypy310_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (11.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

polars_ds-0.2.2-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (10.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

polars_ds-0.2.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.2.2-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

polars_ds-0.2.2-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (14.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

polars_ds-0.2.2-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

polars_ds-0.2.2-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl (11.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ i686

polars_ds-0.2.2-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (10.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

polars_ds-0.2.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

polars_ds-0.2.2-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl (14.6 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ s390x

polars_ds-0.2.2-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.2 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ppc64le

polars_ds-0.2.2-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (10.6 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARMv7l

polars_ds-0.2.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.1 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

polars_ds-0.2.2-cp312-none-win_amd64.whl (10.3 MB view hashes)

Uploaded CPython 3.12 Windows x86-64

polars_ds-0.2.2-cp312-none-win32.whl (8.9 MB view hashes)

Uploaded CPython 3.12 Windows x86

polars_ds-0.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

polars_ds-0.2.2-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl (14.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ s390x

polars_ds-0.2.2-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.2 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ppc64le

polars_ds-0.2.2-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl (11.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ i686

polars_ds-0.2.2-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (10.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARMv7l

polars_ds-0.2.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.1 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

polars_ds-0.2.2-cp312-cp312-macosx_11_0_arm64.whl (9.0 MB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

polars_ds-0.2.2-cp312-cp312-macosx_10_12_x86_64.whl (10.4 MB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

polars_ds-0.2.2-cp311-none-win_amd64.whl (10.3 MB view hashes)

Uploaded CPython 3.11 Windows x86-64

polars_ds-0.2.2-cp311-none-win32.whl (8.9 MB view hashes)

Uploaded CPython 3.11 Windows x86

polars_ds-0.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

polars_ds-0.2.2-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl (14.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ s390x

polars_ds-0.2.2-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.2 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ppc64le

polars_ds-0.2.2-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl (11.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686

polars_ds-0.2.2-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (10.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARMv7l

polars_ds-0.2.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

polars_ds-0.2.2-cp311-cp311-macosx_11_0_arm64.whl (9.0 MB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

polars_ds-0.2.2-cp311-cp311-macosx_10_12_x86_64.whl (10.4 MB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

polars_ds-0.2.2-cp310-none-win_amd64.whl (10.3 MB view hashes)

Uploaded CPython 3.10 Windows x86-64

polars_ds-0.2.2-cp310-none-win32.whl (8.9 MB view hashes)

Uploaded CPython 3.10 Windows x86

polars_ds-0.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

polars_ds-0.2.2-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl (14.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ s390x

polars_ds-0.2.2-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ppc64le

polars_ds-0.2.2-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl (11.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686

polars_ds-0.2.2-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (10.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARMv7l

polars_ds-0.2.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

polars_ds-0.2.2-cp310-cp310-macosx_11_0_arm64.whl (9.0 MB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

polars_ds-0.2.2-cp310-cp310-macosx_10_12_x86_64.whl (10.4 MB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

polars_ds-0.2.2-cp39-none-win_amd64.whl (10.3 MB view hashes)

Uploaded CPython 3.9 Windows x86-64

polars_ds-0.2.2-cp39-none-win32.whl (8.9 MB view hashes)

Uploaded CPython 3.9 Windows x86

polars_ds-0.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

polars_ds-0.2.2-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl (14.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ s390x

polars_ds-0.2.2-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ppc64le

polars_ds-0.2.2-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (11.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686

polars_ds-0.2.2-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (10.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARMv7l

polars_ds-0.2.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page