No project description provided
Project description
Polars for Data Science
Discord
|
Documentation
|
User Guide
|
Want to Contribute?
pip install polars-ds
PDS (polars_ds)
PDS is a modern data science package that
- is fast and furious
- is small and lean, with minimal dependencies
- has an intuitive and concise API (if you know Polars already)
- has dataframe friendly design
- and covers a wide variety of data science topics, such as simple statistics, linear regression, string edit distances, tabular data transforms, feature extraction, traditional modelling pipelines, model evaluation metrics, etc., etc..
It stands on the shoulders of the great Polars dataframe. You can see examples. Here are some highlights!
Parallel ML Metrics Calculation
import polars as pl
import polars_ds as pds
# Parallel evaluation of multiple ML metrics on different segments of data
df.lazy().group_by("segments").agg(
# any other metrics you want in here
pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ segments ┆ roc_auc ┆ log_loss │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞══════════╪══════════╪══════════╡
│ a ┆ 0.497745 ┆ 1.006438 │
│ b ┆ 0.498801 ┆ 0.997226 │
└──────────┴──────────┴──────────┘
In-dataframe linear regression + feature transformations
import polars_ds as pds
from polars_ds.pipeline.transforms import polynomial_features
# If you want the underlying computation to be done in f32, set pds.config.LIN_REG_EXPR_F64 = False
df.select(
pds.lin_reg_report(
*(
["x1", "x2", "x3"] +
polynomial_features(["x1", "x2", "x3"], degree = 2, interaction_only=True)
)
, target = "target"
, add_bias = False
).alias("result")
).unnest("result")
┌──────────┬───────────┬──────────┬───────────┬───────┬───────────┬──────────┬──────────┬──────────┐
│ features ┆ beta ┆ std_err ┆ t ┆ p>|t| ┆ 0.025 ┆ 0.975 ┆ r2 ┆ adj_r2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════════╪══════════╪═══════════╪═══════╪═══════════╪══════════╪══════════╪══════════╡
│ x1 ┆ 0.26332 ┆ 0.000315 ┆ 835.68677 ┆ 0.0 ┆ 0.262703 ┆ 0.263938 ┆ 0.971087 ┆ 0.971085 │
│ ┆ ┆ ┆ 8 ┆ ┆ ┆ ┆ ┆ │
│ x2 ┆ 0.413824 ┆ 0.000311 ┆ 1331.9883 ┆ 0.0 ┆ 0.413216 ┆ 0.414433 ┆ 0.971087 ┆ 0.971085 │
│ ┆ ┆ ┆ 32 ┆ ┆ ┆ ┆ ┆ │
│ x3 ┆ 0.113688 ┆ 0.000315 ┆ 361.29924 ┆ 0.0 ┆ 0.113072 ┆ 0.114305 ┆ 0.971087 ┆ 0.971085 │
│ x1*x2 ┆ -0.097272 ┆ 0.000543 ┆ -179.0377 ┆ 0.0 ┆ -0.098337 ┆ -0.09620 ┆ 0.971087 ┆ 0.971085 │
│ ┆ ┆ ┆ 76 ┆ ┆ ┆ 7 ┆ ┆ │
│ x1*x3 ┆ -0.097266 ┆ 0.000542 ┆ -179.4486 ┆ 0.0 ┆ -0.098329 ┆ -0.09620 ┆ 0.971087 ┆ 0.971085 │
│ ┆ ┆ ┆ 32 ┆ ┆ ┆ 4 ┆ ┆ │
│ x2*x3 ┆ -0.097987 ┆ 0.000542 ┆ -180.7579 ┆ 0.0 ┆ -0.099049 ┆ -0.09692 ┆ 0.971087 ┆ 0.971085 │
│ ┆ ┆ ┆ 6 ┆ ┆ ┆ 4 ┆ ┆ │
└──────────┴───────────┴──────────┴───────────┴───────┴───────────┴──────────┴──────────┴──────────┘
- Normal Linear Regression (pds.lin_reg)
- Lasso, Ridge, Elastic Net (pds.lin_reg, use l1_reg, l2_reg arguments)
- Rolling linear regression with skipping (pds.rolling_lin_reg)
- Recursive linear regression (pds.recursive_lin_reg)
- Non-negative linear regression (pds.lin_reg, set positive = True)
- Statsmodel-like linear regression table (pds.lin_reg_report)
- f32 support (pds.Config.LIN_REG_EXPR_F64 = False)
Making Polars More Convenient
import polars_ds as pds
df = pl.DataFrame({
"group": ['A', 'A', 'B', 'B', 'A']
, "a": [1, 2, 3, 4, 5]
, "b": [4, 1, 99, 12, 33]
})
df.group_by("group").agg(
*pds.E(['a', 'b'], ["min", "max", "n_unique", "len"])
)
shape: (2, 8)
┌───────┬───────┬───────┬───────┬───────┬────────────┬────────────┬─────────┐
│ group ┆ a_min ┆ b_min ┆ a_max ┆ b_max ┆ a_n_unique ┆ b_n_unique ┆ __len__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ u32 ┆ u32 │
╞═══════╪═══════╪═══════╪═══════╪═══════╪════════════╪════════════╪═════════╡
│ A ┆ 1 ┆ 1 ┆ 5 ┆ 33 ┆ 3 ┆ 3 ┆ 3 │
│ B ┆ 3 ┆ 12 ┆ 4 ┆ 99 ┆ 2 ┆ 2 ┆ 2 │
└───────┴───────┴───────┴───────┴───────┴────────────┴────────────┴─────────┘
Tabular Machine Learning Data Transformation Pipeline
See SKLEARN_COMPATIBILITY for more details.
import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint
bp = (
Blueprint(df, name = "example", target = "approved", lowercase=True) # You can optionally
.filter(
"city_category is not null" # or equivalently, you can do: pl.col("city_category").is_not_null()
)
.linear_impute(features = ["var1", "existing_emi"], target = "loan_period")
.impute(["existing_emi"], method = "median")
.append_expr( # generate some features
pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
pl.col("loan_amount").clip(lower_bound = 0, upper_bound = 1000).alias("loan_amount_clipped"),
pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
)
.scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
) # Scale the columns up to this point. The columns below won't be scaled
.append_expr(
# Add missing flags
pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
)
.one_hot_encode("gender", drop_first=True)
.woe_encode("city_category") # No need to specify target because we initialized bp with a target
.target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)
print(bp)
pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks! (examples/pipeline.ipynb)
df_transformed = pipe.transform(df)
df_transformed.head()
Since Polars >=1.34 supports collect_batches(), you can also use this to perform batched machine learning
for df_batch in pipe.transform(df, return_lazy=True).collect_batches():
X_batch, y_batch = your_function_to_turn_df_batch_into_model_inputs(df_batch)
ml_model.update(X_batch, y_batch)
See pipeline examples for more details and caveats.
Nearest Neighbors Related Queries
Get all neighbors within radius r, call them best friends, and count the number. Due to limitations, this currently doesn't preserve the index, and is not fast when k or dimension of data is large.
df.select(
pl.col("id"),
pds.query_radius_ptwise(
pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in 3d space
index = pl.col("id"),
r = 0.1,
dist = "sql2", # squared l2
parallel = True
).alias("best friends"),
).with_columns( # -1 to remove the point itself
(pl.col("best friends").list.len() - 1).alias("best friends count")
).head()
shape: (5, 3)
┌─────┬───────────────────┬────────────────────┐
│ id ┆ best friends ┆ best friends count │
│ --- ┆ --- ┆ --- │
│ u32 ┆ list[u32] ┆ u32 │
╞═════╪═══════════════════╪════════════════════╡
│ 0 ┆ [0, 811, … 1435] ┆ 152 │
│ 1 ┆ [1, 953, … 1723] ┆ 159 │
│ 2 ┆ [2, 355, … 835] ┆ 243 │
│ 3 ┆ [3, 102, … 1129] ┆ 110 │
│ 4 ┆ [4, 1280, … 1543] ┆ 226 │
└─────┴───────────────────┴────────────────────┘
Distances
Various string distances:
df.select( # Column "word", compared to string in pl.lit(). It also supports column vs column comparison
pds.str_leven("word", pl.lit("asasasa"), return_sim=True).alias("Levenshtein"),
pds.str_osa("word", pl.lit("apples"), return_sim=True).alias("Optimal String Alignment"),
pds.str_jw("word", pl.lit("apples")).alias("Jaro-Winkler"),
)
Array, list distances:
df = pl.DataFrame({
"x": [[1,2,3], [4,5,6]]
, "y": [[0.5, 0.2, 0.3], [4.0, 5.0, 6.1]]
})
df.select(
x = pl.col('x').cast(pl.Array(inner=pl.Float64, shape=3))
, y = pl.col('y').cast(pl.Array(inner=pl.Float64, shape=3))
).select(
pds.arr_sql2_dist('x', 'y')
)
shape: (2, 1)
┌───────┐
│ x │
│ --- │
│ f64 │
╞═══════╡
│ 10.78 │
│ 0.01 │
└───────┘
Replace arr_sql2_dist with list_sql2_dist. Note: sql2 stands for squared l2 distance, which is the same as squared euclidean distance.
In-dataframe statistical tests
df.group_by("market_id").agg(
pds.ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
pds.chi2("category_1", "category_2").alias("chi2-test"),
pds.f_test("var1", group = "category_1").alias("f-test")
)
shape: (3, 4)
┌───────────┬──────────────────────┬──────────────────────┬─────────────────────┐
│ market_id ┆ t-test ┆ chi2-test ┆ f-test │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ struct[2] ┆ struct[2] ┆ struct[2] │
╞═══════════╪══════════════════════╪══════════════════════╪═════════════════════╡
│ 0 ┆ {2.072749,0.038272} ┆ {33.487634,0.588673} ┆ {0.312367,0.869842} │
│ 1 ┆ {0.469946,0.638424} ┆ {42.672477,0.206119} ┆ {2.148937,0.072536} │
│ 2 ┆ {-1.175325,0.239949} ┆ {28.55723,0.806758} ┆ {0.506678,0.730849} │
└───────────┴──────────────────────┴──────────────────────┴─────────────────────┘
Compatibility
Under some mild assumptions, (e.g. columns implement to_numpy()), PDS works with other eager dataframes. For example, with Pandas:
from polars_ds.compat import compat as pds2
df_pd["linear_regression_result"] = pds2.lin_reg(
df_pd["x1"], df_pd["x2"], df_pd["x3"],
target = df_pd["y"],
return_pred = True
)
df_pd
The magic here is the compat module and the fact that most eager dataframes implement the array protocal.
Other
Other common numerical functions such as: pds.convolve, pds.query_r2, pds.principal_components, etc. See our docs for more information.
Getting Started
import polars_ds as pds
To make full use of the Diagnosis module, do
pip install "polars_ds[plot]"
How Fast is it?
Feel free to take a look at our benchmark notebook!
Generally speaking, the more expressions you want to evaluate simultaneously, the faster Polars + PDS will be than Pandas + (SciPy / Sklearn / NumPy). The more CPU cores you have on your machine, the bigger the time difference will be in favor of Polars + PDS.
HELP WANTED!
- Documentation writing, testing, documentation, benchmarking, etc.
Road Map
- K-means, K-medoids clustering as expressions and also standalone modules.
- Other improvement items. See issues.
Minimum Polars Support + Streaming Compatibility
This library will only depend on python Polars (for most of its core) and will try to be as stable as possible for polars>=1.4.0. Exceptions will be made when Polars's update forces changes in the plugins. However, Polars updates quickly and older versions may not be tested. Currently, it is actively tested for Polars>=1.33.
This package is also not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed. This concerns plugin expressions like pds.lin_reg, etc, which won't work with streaming. By the same token, Polars large index version is not supported at this point, and I welcome any 3rd packaging. However, I will try to support some expressions with the streaming engine, as they may be important.
Build From Source
The guide here is not specific to LTS CPU, and can be used generally.
The best advice for LTS CPU is that you should compile the package yourself. First clone the repo and make sure Rust is installed on the system. Create a python virtual environment and install maturin in it. Next set the RUSTFLAG environment variable. The official polars-lts-cpu features are the following:
RUSTFLAGS=-C target-feature=+sse3,+ssse3,+sse4.1,+sse4.2,+popcnt,+cmpxchg16b
If you simply want to compile from source, you may set target cpu to native, which autodetects CPU features.
RUSTFLAGS=-C target-cpu=native
If you are compiling for LTS CPU, then in pyproject.toml, update the polars dependency to polars-lts-cpu:
polars >= 1.4.0 # polars-lts-cpu >= 1.4.0
Lastly, run
maturin develop --release
If you want to test the build locally, you may run
# pip install -r requirements-test.txt
pytest tests/test_*
If you see this error in pytest, it means setuptools is not installed and you may ignore it. It is just a legacy python builtin package.
tests/test_many.py::test_xi_corr - ModuleNotFoundError: No module named 'pkg_resources'
You can then publish it to your private PYPI server, or just use it locally.
Credits
- Some statistics functions are taken from Statrs (MIT) and internalized. See here
- Linear algebra routines are powered mostly by faer
- String similarity metrics are soooo fast because of RapidFuzz
Other Projects
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_ds-0.10.5.tar.gz.
File metadata
- Download URL: polars_ds-0.10.5.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6dfba02c20c0c21b7af6e2fb62d433a6e4d032716b3cbd48b02eb15811c4594
|
|
| MD5 |
7ec3c52fede2a33d49c98b3763fa325d
|
|
| BLAKE2b-256 |
506bea602e1823baab4732262da7dfae2daa1c41e48ede47643e5d2b77fcca76
|
File details
Details for the file polars_ds-0.10.5-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: polars_ds-0.10.5-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 20.1 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c8674104c8aba9c68bdf9bf00765140a2a3f97e29a6745fac9c4c70cf2e8dc6
|
|
| MD5 |
7f75846be7710d04b8ac9f80d5a71e08
|
|
| BLAKE2b-256 |
cb6b9a73dbfac3672b966869462c108b4e13dcb906535edc0673ce16e84589b5
|
File details
Details for the file polars_ds-0.10.5-cp39-abi3-manylinux_2_24_aarch64.whl.
File metadata
- Download URL: polars_ds-0.10.5-cp39-abi3-manylinux_2_24_aarch64.whl
- Upload date:
- Size: 16.7 MB
- Tags: CPython 3.9+, manylinux: glibc 2.24+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b29c564642d77b5b297cfba27183cdd4187c0b734968a836ffc442a3fb45941
|
|
| MD5 |
ec2529db7030e9cb2882d94380b0aa87
|
|
| BLAKE2b-256 |
1f727c1e51f0b86ce8b95c2cea858cb09bf3a037decdc8072d346a4b0435ea3f
|
File details
Details for the file polars_ds-0.10.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: polars_ds-0.10.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 19.0 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69a4cab01b7d8083877902527f1feaf0824bb708c84521dc766999e86ac1bb3a
|
|
| MD5 |
5f4447295e3789c7934d073724441b85
|
|
| BLAKE2b-256 |
306b727bf098f747d7f2c8527d18b1b12e920bd02248abada722f731e2c7bb5d
|
File details
Details for the file polars_ds-0.10.5-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: polars_ds-0.10.5-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 16.2 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6eafda48804f4314175e2a21249b29dd8fac3e2eacb708f192230814ef848ffe
|
|
| MD5 |
c2a1d84c4a9f54266c04e82607b1a8f8
|
|
| BLAKE2b-256 |
de3222be23743122a884ebc832c1ec10a2a346d77842df9b0dfb393892b2ce0c
|
File details
Details for the file polars_ds-0.10.5-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: polars_ds-0.10.5-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 18.4 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
842df81c817627df69af1e38879e7f24b27d364ee8577fcc3e8053c2a9aa7018
|
|
| MD5 |
863989fed14c53fc6988d01aa4743a6a
|
|
| BLAKE2b-256 |
5e1b1d23b47b24b08711b28e2ed89a9e23feb3160b3f549917325f7ea21b7680
|