Skip to main content

Fast similarity join for polars DataFrames. Fork by Edwardvaneechoud with fixes.

Project description

polars_simed

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to a left join or join_asof but for strings instead of timestamps.

Installation

pip install polars_simed

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# install python dependencies and compile the rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_simed as ps

df_left = pl.DataFrame(
    {
        "name": ["alice", "bob", "charlie", "david"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["ali", "alice in wonderland", "bobby", "tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f32       str                 
╞═══════╪══════════╪═════════════════════╡
 alice  0.57735   ali                 
 alice  0.522233  alice in wonderland 
 bob    0.57735   bobby               
└───────┴──────────┴─────────────────────┘

Performance

A benchmark can be executed with make run-bench. In general, the performance heavily depends on the length of the dataframes. By default, the computation is parallelized over the left dataframe. However, serveral benchmarks showed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe.

If no normalization is applied, the performance is usually better since the a small uint type will be used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_simed-0.3.3.tar.gz (47.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_simed-0.3.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

polars_simed-0.3.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

polars_simed-0.3.3-cp38-abi3-macosx_11_0_arm64.whl (2.8 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

polars_simed-0.3.3-cp38-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file polars_simed-0.3.3.tar.gz.

File metadata

  • Download URL: polars_simed-0.3.3.tar.gz
  • Upload date:
  • Size: 47.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.4

File hashes

Hashes for polars_simed-0.3.3.tar.gz
Algorithm Hash digest
SHA256 2a7360e8517256fdc9c4675a2647345727bcbd6ea2b606dfa93241aba62f1ced
MD5 63055cf57bb0b973beedc74464c0b050
BLAKE2b-256 a7e405a31ab527d897963a9c2f1b17c38830d61b08f5dc79b061d0063f610603

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 22d3033014865f57c3308c556d7c7edcee96e967284527b15876ecb8614b09c4
MD5 52b64c29ea590fbeae02b2554657e9f9
BLAKE2b-256 81c6bac5361ee986a1dfc175f0971519b38eb78c1481fad288220c57b11d2970

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4fabea269a264b51d5f22c6ebe0b1a4ace348e98a3334f148eb18f58c067976f
MD5 240c71ad977b0dbbff1aa037c2ca4500
BLAKE2b-256 f371ae3b44296b3631fa64956b8670cfccac0dca9dbf290614f4204e10519e0d

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.3-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0c8ee488d5051f93fa650d503190d99f3bbbf8c668ffc93471338bdbe4160a52
MD5 a9c31a2fc016ea0fa8cc6426859de0cc
BLAKE2b-256 2686d6964bf4e1a9706786393774f1300fc7a7a08a2bdc1a0718d41bf2d1b6a7

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.3-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.3-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3a9f8a676f2d5d8c1113f131934c1aa6662c38f02e1e56e83be1e4e02f7107e0
MD5 96f5e4a8c1565bbed92f0ded4194a42d
BLAKE2b-256 cd1ad40ffc159918b8cf1226704a3463baa6852faca5240d5d4388e843a2cb1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page