Skip to main content

Fast similarity join for polars DataFrames.

Project description

polars_sim

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to join_asof but for strings instead of timestamps.

Installation

pip install polars_sim

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# create a virtual environment
uv venv --seed -p 3.11
# install dependencies
uv pip install -e .
# install dev dependencies
uv pip install -r requirements.txt
# compiple rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_sim as ps

df_left = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["Ali", "Alice in Wonderland", "Bobby", "Tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f64       str                 
╞═══════╪══════════╪═════════════════════╡
 Alice  0.57735   Ali                 
 Alice  0.522233  Alice in Wonderland 
 Bob    0.57735   Bobby               
└───────┴──────────┴─────────────────────┘

Performance

A benchmark can be executed with make run-bench. In general, the performance heavily depends on the length of the dataframes. By default, the computation is parallelized over one of the two dataframes, depending on the sizes. If the left dataframe is comparatively small, the computation is parallelized over the right dataframe and vice versa. The behaviour can be fixed with the threading_dimenstion parameter.

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sim-0.2.3.tar.gz (34.6 kB view details)

Uploaded Source

Built Distributions

polars_sim-0.2.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_sim-0.2.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

polars_sim-0.2.3-cp38-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_sim-0.2.3-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_sim-0.2.3.tar.gz.

File metadata

  • Download URL: polars_sim-0.2.3.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for polars_sim-0.2.3.tar.gz
Algorithm Hash digest
SHA256 2c1563697e855a0009e74098c0bddbae44b806bb0c9c72d81c6348d6d0339b3d
MD5 a2d4199f963888cab82f1504617509ca
BLAKE2b-256 1f30ff901c5fe75014be6b61bc122e3ccdef4555e75f90ae1c2acfefe1f092b4

See more details on using hashes here.

File details

Details for the file polars_sim-0.2.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.2.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f3e055a9a2d17389f3503b54e317d14298059275516cc0f321f420c3613f3922
MD5 15cac4be0de0b9102805486a102cd314
BLAKE2b-256 256150234e7ff0ebd0c8c0407acd238f8af43b937cf7b9e561863f6b0684f9ef

See more details on using hashes here.

File details

Details for the file polars_sim-0.2.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.2.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4f2c3b7329d6c3efd93b429395f78808a33403a5367991cc71d0de9f0e33a466
MD5 e02680ee7d66376d08bde254f46a6296
BLAKE2b-256 99165f67836c55645faf0b67a916b4c8baaa430833a55b1d329965780b291fa8

See more details on using hashes here.

File details

Details for the file polars_sim-0.2.3-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sim-0.2.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fbe4af22476f965e29741aba4b6c6babb2b5114c7f64903baa6e1561cb9ccf5c
MD5 1c6c3bd68c36ec10d2a3e9b70e1ed339
BLAKE2b-256 6cbfadac44eaa3f8ec16bfed213683619d932349bf5e9ee655b46bebb8ddaaec

See more details on using hashes here.

File details

Details for the file polars_sim-0.2.3-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.2.3-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c980023b4307f53678c9f0dd1fd9f23fd21caad1d501e7fd84239cef4d92e02c
MD5 ba3980ac953aba54c4fd9f3187d83036
BLAKE2b-256 f15bf3ae82ef56d17d81189817bafba196eb00de15c66f5d5165245660f35f14

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page