Skip to main content

No project description provided

Project description

Polars for Data Science

Discord | Documentation | User Guide | Want to Contribute?
pip install polars-ds

PDS (polars_ds)

PDS is a modern data science package that

  1. is fast and furious
  2. is small and lean, with minimal dependencies
  3. has an intuitive and concise API (if you know Polars already)
  4. has dataframe friendly design
  5. and covers a wide variety of data science topics, such as simple statistics, linear regression, string edit distances, tabular data transforms, feature extraction, traditional modelling pipelines, model evaluation metrics, etc., etc..

It stands on the shoulders of the great Polars dataframe. You can see examples. Here are some highlights!

Parallel ML Metrics Calculation

import polars as pl
import polars_ds as pds
# Parallel evaluation of multiple ML metrics on different segments of data
df.lazy().group_by("segments").agg( 
    # any other metrics you want in here
    pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
    pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()

shape: (2, 3)
┌──────────┬──────────┬──────────┐
 segments  roc_auc   log_loss 
 ---       ---       ---      
 str       f64       f64      
╞══════════╪══════════╪══════════╡
 a         0.497745  1.006438 
 b         0.498801  0.997226 
└──────────┴──────────┴──────────┘

Quick and simple modeling on the fly (non-persistent)

E.g. running a quick linear regression and see the predictions and residues:

df.select(pds.lin_reg(pl.col("x1"), pl.col("x2"), target=pl.col("y"), add_bias=False, return_pred=True))

shape: (10_000, 1)
┌───────────────────────┐
 lr_pred               
 ---                   
 struct[2]             
╞═══════════════════════╡
 {-0.3121,0.392769}    
 {-0.459507,-0.048989} 
 {-0.469473,-0.215709} 
 {-0.243764,-0.707016} 
 {-0.511278,-0.785299} 
                      
└───────────────────────┘

Generating polynomial features and display a statsmodels-like regression summary:

import polars_ds as pds
from polars_ds.pipeline.transforms import polynomial_features
# If you want the underlying computation to be done in f32, set pds.config.LIN_REG_EXPR_F64 = False
df.select(
    pds.lin_reg_report(
        *(
            ["x1", "x2", "x3"] +
            polynomial_features(["x1", "x2", "x3"], degree = 2, interaction_only=True)
        )
        , target = pl.col("target")
        , add_bias = False
    ).alias("result")
).unnest("result")

┌──────────┬───────────┬──────────┬───────────┬───────┬───────────┬──────────┬──────────┬──────────┐
 features  beta       std_err   t          p>|t|  0.025      0.975     r2        adj_r2   
 ---       ---        ---       ---        ---    ---        ---       ---       ---      
 str       f64        f64       f64        f64    f64        f64       f64       f64      
╞══════════╪═══════════╪══════════╪═══════════╪═══════╪═══════════╪══════════╪══════════╪══════════╡
 x1        0.26332    0.000315  835.68677  0.0    0.262703   0.263938  0.971087  0.971085 
                                8                                                         
 x2        0.413824   0.000311  1331.9883  0.0    0.413216   0.414433  0.971087  0.971085 
                                32                                                        
 x3        0.113688   0.000315  361.29924  0.0    0.113072   0.114305  0.971087  0.971085 
 x1*x2     -0.097272  0.000543  -179.0377  0.0    -0.098337  -0.09620  0.971087  0.971085 
                                76                           7                            
 x1*x3     -0.097266  0.000542  -179.4486  0.0    -0.098329  -0.09620  0.971087  0.971085 
                                32                           4                            
 x2*x3     -0.097987  0.000542  -180.7579  0.0    -0.099049  -0.09692  0.971087  0.971085 
                                6                            4                            
└──────────┴───────────┴──────────┴───────────┴───────┴───────────┴──────────┴──────────┴──────────┘

Other available simple models (non-persistent):

  • Normal Linear Regression (pds.lin_reg)
  • Lasso, Ridge, Elastic Net (pds.lin_reg, use l1_reg, l2_reg arguments)
  • Rolling linear regression with skipping (pds.rolling_lin_reg)
  • Recursive linear regression (pds.recursive_lin_reg)
  • Non-negative linear regression (pds.lin_reg, set positive = True)
  • Statsmodel-like linear regression table (pds.lin_reg_report)
  • f32 support (pds.Config.LIN_REG_EXPR_F64 = False)
  • binary logistic regression with L1, L2 parameters (pds.logistic_reg, doesn't work with F32 yet.)

Distances

Various string distances:

df.select( # Column "word", compared to string in pl.lit(). It also supports column vs column comparison
    pds.str_leven("word", pl.lit("asasasa"), return_sim=True).alias("Levenshtein"),
    pds.str_osa("word", pl.lit("apples"), return_sim=True).alias("Optimal String Alignment"),
    pds.str_jw("word", pl.lit("apples")).alias("Jaro-Winkler"),
)

Array, list distances:

df = pl.DataFrame({
    "x": [[1,2,3], [4,5,6]]
    , "y": [[0.5, 0.2, 0.3], [4.0, 5.0, 6.1]]
})

df.select(
    x = pl.col('x').cast(pl.Array(inner=pl.Float64, shape=3))
    , y = pl.col('y').cast(pl.Array(inner=pl.Float64, shape=3))
).lazy().select(
    pds.arr_sql2_dist('x', 'y')
).collect()

shape: (2, 1)
┌───────┐
 x     
 ---   
 f64   
╞═══════╡
 10.78 
 0.01  
└───────┘

Replace arr_sql2_dist with list_sql2_dist. Note: sql2 stands for squared l2 distance, which is the same as squared euclidean distance.

In-dataframe statistical tests

df.group_by("market_id").agg(
    pds.ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
    pds.chi2("category_1", "category_2").alias("chi2-test"),
    pds.f_test("var1", group = "category_1").alias("f-test")
)

shape: (3, 4)
┌───────────┬──────────────────────┬──────────────────────┬─────────────────────┐
 market_id  t-test                chi2-test             f-test              
 ---        ---                   ---                   ---                 
 i64        struct[2]             struct[2]             struct[2]           
╞═══════════╪══════════════════════╪══════════════════════╪═════════════════════╡
 0          {2.072749,0.038272}   {33.487634,0.588673}  {0.312367,0.869842} 
 1          {0.469946,0.638424}   {42.672477,0.206119}  {2.148937,0.072536} 
 2          {-1.175325,0.239949}  {28.55723,0.806758}   {0.506678,0.730849} 
└───────────┴──────────────────────┴──────────────────────┴─────────────────────┘

Making Polars More Convenient

import polars_ds as pds
df = pl.DataFrame({
    "group": ['A', 'A', 'B', 'B', 'A']
    , "a": [1, 2, 3, 4, 5]
    , "b": [4, 1, 99, 12, 33]
})
df.group_by("group").agg(
    *pds.E(['a', 'b'], ["min", "max", "n_unique", "len"])
)

shape: (2, 8)
┌───────┬───────┬───────┬───────┬───────┬────────────┬────────────┬─────────┐
 group  a_min  b_min  a_max  b_max  a_n_unique  b_n_unique  __len__ 
 ---    ---    ---    ---    ---    ---         ---         ---     
 str    i64    i64    i64    i64    u32         u32         u32     
╞═══════╪═══════╪═══════╪═══════╪═══════╪════════════╪════════════╪═════════╡
 A      1      1      5      33     3           3           3       
 B      3      12     4      99     2           2           2       
└───────┴───────┴───────┴───────┴───────┴────────────┴────────────┴─────────┘

Streamable Tabular Machine Learning Data Transformation Pipeline

See SKLEARN_COMPATIBILITY for more details.

import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint

bp = (
    Blueprint(df, name = "example", target = "approved", lowercase=True) # You can optionally 
    .filter(pl.col("city_category").is_not_null())
    .linear_impute(features = ["var1", "existing_emi"], target = "loan_period") 
    .impute(["existing_emi"], method = "median")
    .append_expr( # generate some features
        pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
        pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
        pl.col("loan_amount").clip(lower_bound = 0, upper_bound = 1000).alias("loan_amount_clipped"),
        pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
        pl.col("loan_amount").shift(-1).alias("loan_amount_lead_1") # shift(-1) is a lead transform
    )
    .scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
        cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
    ) # Scale the columns up to this point. The columns below won't be scaled
    .append_expr(
        # Add missing flags
        pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
    )
    .one_hot_encode("gender", drop_first=True)
    .woe_encode("city_category") # No need to specify target because we initialized bp with a target
    .target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)

print(bp)

pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks! (examples/pipeline.ipynb)
df_transformed = pipe.transform(df)
df_transformed.head()

Since Polars >=1.34 supports collect_batches(), you can also use this to perform batched machine learning

for df_batch in pipe.transform(df, return_lazy=True).collect_batches():
    X_batch, y_batch = your_function_to_turn_df_batch_into_model_inputs(df_batch)
    ml_model.update(X_batch, y_batch)

See pipeline examples for more details and caveats.

Nearest Neighbors Related Queries

Get all neighbors within radius r, call them best friends, and count the number. Due to limitations, this currently doesn't preserve the index, and is not fast when k or dimension of data is large.

df.select(
    pl.col("id"),
    pds.query_radius_ptwise(
        pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in 3d space
        index = pl.col("id"),
        r = 0.1, 
        dist = "sql2", # squared l2
        parallel = True
    ).alias("best friends"),
).with_columns( # -1 to remove the point itself
    (pl.col("best friends").list.len() - 1).alias("best friends count")
).head()

shape: (5, 3)
┌─────┬───────────────────┬────────────────────┐
 id   best friends       best friends count 
 ---  ---                ---                
 u32  list[u32]          u32                
╞═════╪═══════════════════╪════════════════════╡
 0    [0, 811,  1435]   152                
 1    [1, 953,  1723]   159                
 2    [2, 355,  835]    243                
 3    [3, 102,  1129]   110                
 4    [4, 1280,  1543]  226                
└─────┴───────────────────┴────────────────────┘

Compatibility

Under some mild assumptions, (e.g. columns implement to_numpy()), PDS works with other eager dataframes. For example, with Pandas:

from polars_ds.compat import compat as pds2

df_pd["linear_regression_result"] = pds2.lin_reg(
    df_pd["x1"], df_pd["x2"], df_pd["x3"],
    target = df_pd["y"],
    return_pred = True
)
df_pd

The magic here is the compat module and the fact that most eager dataframes implement the array protocal.

Other

Other common numerical functions such as: pds.convolve, pds.query_r2, pds.principal_components, etc. See our docs for more information.

Getting Started

import polars_ds as pds

To make full use of the Diagnosis module, do

pip install "polars_ds[plot]"

How Fast is it?

Feel free to take a look at our benchmark notebook!

Generally speaking, the more expressions you want to evaluate simultaneously, the faster Polars + PDS will be than Pandas + (SciPy / Sklearn / NumPy). The more CPU cores you have on your machine, the bigger the time difference will be in favor of Polars + PDS.

HELP WANTED!

  1. Documentation writing, testing, documentation, benchmarking, etc.

Road Map

  1. K-means, K-medoids clustering as expressions and also standalone modules.
  2. Other improvement items. See issues.

Minimum Polars Support + Streaming Compatibility

This library will only depend on python Polars (for most of its core) and will try to be as stable as possible for polars>=1.4.0. Exceptions will be made when Polars's update forces changes in the plugins. However, Polars updates quickly and older versions may not be tested. Currently, it is actively tested for Polars>=1.33.

This package is also not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed. This concerns plugin expressions like pds.lin_reg, etc, which won't work with streaming. By the same token, Polars large index version is not supported at this point, and I welcome any 3rd packaging. However, I will try to support some expressions with the streaming engine, as they may be important.

Build From Source

The guide here is not specific to LTS CPU, and can be used generally.

The best advice for LTS CPU is that you should compile the package yourself. First clone the repo and make sure Rust is installed on the system. Create a python virtual environment and install maturin in it. Next set the RUSTFLAG environment variable. The official polars-lts-cpu features are the following:

RUSTFLAGS=-C target-feature=+sse3,+ssse3,+sse4.1,+sse4.2,+popcnt,+cmpxchg16b

If you simply want to compile from source, you may set target cpu to native, which autodetects CPU features.

RUSTFLAGS=-C target-cpu=native

If you are compiling for LTS CPU, then in pyproject.toml, update the polars dependency to polars-lts-cpu:

polars >= 1.4.0 # polars-lts-cpu >= 1.4.0

Lastly, run

maturin develop --release

If you want to test the build locally, you may run

# pip install -r requirements-test.txt
pytest tests/test_*

If you see this error in pytest, it means setuptools is not installed and you may ignore it. It is just a legacy python builtin package.

tests/test_many.py::test_xi_corr - ModuleNotFoundError: No module named 'pkg_resources'

You can then publish it to your private PYPI server, or just use it locally.

Credits

  1. Some statistics functions are taken from Statrs (MIT) and internalized. See here
  2. Linear algebra routines are powered mostly by faer
  3. String similarity metrics are soooo fast because of RapidFuzz

Other Projects

  1. Caching for Polars

AI Usage Disclosure

Since this project is mostly maintained by a single person, and a single person cannot be fluent with all topics in scientific programming, some of the code in this package is guided by AI and reviewed by me. I will not accept vibe-coded functions. For AI-generated PRs, see contributing.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ds-0.11.1.tar.gz (2.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_ds-0.11.1-cp39-abi3-win_amd64.whl (20.0 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_ds-0.11.1-cp39-abi3-manylinux_2_24_aarch64.whl (16.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.24+ ARM64

polars_ds-0.11.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_ds-0.11.1-cp39-abi3-macosx_11_0_arm64.whl (16.2 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_ds-0.11.1-cp39-abi3-macosx_10_12_x86_64.whl (18.3 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file polars_ds-0.11.1.tar.gz.

File metadata

  • Download URL: polars_ds-0.11.1.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for polars_ds-0.11.1.tar.gz
Algorithm Hash digest
SHA256 42c00ff1d7c74d66ef23e4857a24d2fce318ec72e0b1606ba8c8487f6f62e67b
MD5 3df5fa13ac640fe2bc879f5feb6984ab
BLAKE2b-256 77303b5a031818b1c768274a2c2c418eb8168898e8c7967e1bd4c34e295b04b4

See more details on using hashes here.

File details

Details for the file polars_ds-0.11.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_ds-0.11.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 45fccd21fdca03688c3741909e4bcc79a3220649ae98a6931877371517aca859
MD5 74faf4688378c6adbc5118b7588df8d7
BLAKE2b-256 e96add1e4119af22fafcae5ff910d3f95fd6151a1335bfc193fb8161e57f258b

See more details on using hashes here.

File details

Details for the file polars_ds-0.11.1-cp39-abi3-manylinux_2_24_aarch64.whl.

File metadata

File hashes

Hashes for polars_ds-0.11.1-cp39-abi3-manylinux_2_24_aarch64.whl
Algorithm Hash digest
SHA256 bddbe77acb9dda3370889ba8d8bda1cdb21ad0ae9af827080dab55abfb0250cf
MD5 5f6bd6eb4c988864d52246274097c998
BLAKE2b-256 b2b07147e5be4848d5e4281a4f5608844368d6e045ea4decc3f1a46ffd3770aa

See more details on using hashes here.

File details

Details for the file polars_ds-0.11.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.11.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b56d7cce964e7c615e517925371d9b0e24ab9afe3b0a7a3afd8254400d70249f
MD5 77f4d7953cf9f026bc13436c2bfded81
BLAKE2b-256 89aebe0e1cc5cc014ae4ab2a8e1bfe91739af47d6e445928ea3094dde57799ab

See more details on using hashes here.

File details

Details for the file polars_ds-0.11.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_ds-0.11.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 78848dda6d75bd6e1cbaca2462d4f033f4a873a2feb16df03bec854c5597ae86
MD5 5fa2846a0baf47786c27250d63b7734e
BLAKE2b-256 254fed18671042b64d3c6da6332cd3c52b68e85f66e6341052e1b2020753a40a

See more details on using hashes here.

File details

Details for the file polars_ds-0.11.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_ds-0.11.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d2e312b62e840ea716d811a6ceaa197d4f00ca8f915fb4bc5cee83807a5c12b0
MD5 2cd163d2ed5d804dfa9013e269a76676
BLAKE2b-256 57a2fc26fa8658df4ad81f5eb3f85760d2a26ba88f64eef0bf87f91003f34388

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page