Skip to main content

Fast similarity join for polars DataFrames. Fork by Edwardvaneechoud with fixes.

Project description

polars_simed

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to a left join or join_asof but for strings instead of timestamps.

Installation

pip install polars_simed

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# install python dependencies and compile the rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_simed as ps

df_left = pl.DataFrame(
    {
        "name": ["alice", "bob", "charlie", "david"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["ali", "alice in wonderland", "bobby", "tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f32       str                 
╞═══════╪══════════╪═════════════════════╡
 alice  0.57735   ali                 
 alice  0.522233  alice in wonderland 
 bob    0.57735   bobby               
└───────┴──────────┴─────────────────────┘

Performance

A benchmark can be executed with make run-bench. In general, the performance heavily depends on the length of the dataframes. By default, the computation is parallelized over the left dataframe. However, serveral benchmarks showed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe.

If no normalization is applied, the performance is usually better since the a small uint type will be used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_simed-0.3.4.tar.gz (47.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_simed-0.3.4-cp38-abi3-win_amd64.whl (3.5 MB view details)

Uploaded CPython 3.8+Windows x86-64

polars_simed-0.3.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

polars_simed-0.3.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

polars_simed-0.3.4-cp38-abi3-macosx_11_0_arm64.whl (2.8 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

polars_simed-0.3.4-cp38-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file polars_simed-0.3.4.tar.gz.

File metadata

  • Download URL: polars_simed-0.3.4.tar.gz
  • Upload date:
  • Size: 47.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.4

File hashes

Hashes for polars_simed-0.3.4.tar.gz
Algorithm Hash digest
SHA256 50d66c7d1244ca745ce6577186f1a04aae09d7dac7a0c295e5acd6ea859e7901
MD5 ffac69212ca115d1aaf138b268b9e399
BLAKE2b-256 deb14f26eeae33478f0d121ade73146e71b42cf98a418fc1bca0d60bf0e848af

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.4-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.4-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b787c0352c310f6e748ae2bcbef6ac54ab7f53115998f9bd44a187cc17a359a1
MD5 8adfb2c61b1630bd7da908927d032315
BLAKE2b-256 ed38e0372af3b644ed779fba70a1bfd802bb7600fb592ea301696af6ae673cb6

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.4-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 71509530b20115b35bfc1d3b8113e374d21841a6d7ce3d6984afde5ea9b17401
MD5 f86ca69866080457e27e7f9fb2bfee53
BLAKE2b-256 d1a7bec9fe45fdc927b2cca7ac6567c0c627f1059a794bc505368c23180a43e1

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.4-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cd61cbd0cb2a9849991f6b4a9d8bbf366219d7e9004c304ff44642cce542b315
MD5 0983c5e62e3ced36d42e4500a596105a
BLAKE2b-256 14b6516130f5255dd28401ceb3a54f7b84b59bf5494605c48979fb072769a218

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.4-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.4-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e6755e2d895efdaf65d1948cf22999e6d9afc488b2f31a5c47850d89266c4b23
MD5 518f1d14a69d188e99fbf0b2b2d7615b
BLAKE2b-256 83e9a229e5c22ec4d31a207a6c0e477588fd8b4d832f4b07ff576620faf83510

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.4-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.4-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fd8adf37248e19397f2ac9772d8bda49b67e2e22b3209c6b2decba181addede3
MD5 942328f1caba5b93e2db8e93d07f92a0
BLAKE2b-256 b1b8660f7cc2dc9270bb81a8efda3c2a9edff3b102e2425ff26f08ddeddd342d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page