Skip to main content

Fast similarity join for polars DataFrames. Fork by Edwardvaneechoud with fixes.

Project description

polars_simed

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to a left join or join_asof but for strings instead of timestamps.

Installation

pip install polars_simed

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# install python dependencies and compile the rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_simed as ps

df_left = pl.DataFrame(
    {
        "name": ["alice", "bob", "charlie", "david"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["ali", "alice in wonderland", "bobby", "tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f32       str                 
╞═══════╪══════════╪═════════════════════╡
 alice  0.57735   ali                 
 alice  0.522233  alice in wonderland 
 bob    0.57735   bobby               
└───────┴──────────┴─────────────────────┘

Performance

A benchmark can be executed with make run-bench. In general, the performance heavily depends on the length of the dataframes. By default, the computation is parallelized over the left dataframe. However, serveral benchmarks showed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe.

If no normalization is applied, the performance is usually better since the a small uint type will be used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_simed-0.3.2.tar.gz (47.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_simed-0.3.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

polars_simed-0.3.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

polars_simed-0.3.2-cp38-abi3-macosx_11_0_arm64.whl (2.8 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

polars_simed-0.3.2-cp38-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file polars_simed-0.3.2.tar.gz.

File metadata

  • Download URL: polars_simed-0.3.2.tar.gz
  • Upload date:
  • Size: 47.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.4

File hashes

Hashes for polars_simed-0.3.2.tar.gz
Algorithm Hash digest
SHA256 c83af25841944e1bd51305cce7526f3f76912b085c9333bc045d3f4f9d312355
MD5 83d9534e9b40c92b2b79d75b949a1259
BLAKE2b-256 9b2f1695768af5cb72b44e6f20b2a4c58b2f2826b725cfd83e12ef4439358b9f

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d518ce7cb0b83ee2469cf55404cae5c164b5c706682acf344ac7418a46707672
MD5 6fcffb5904e50b6bc637b1ca7a905d65
BLAKE2b-256 e9a5398d389ebc23849b6c55718389b3b058895c03a68f2196fc85d97a19a1ba

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 688a85667d324d20d47198c3acdae3264efdce33ba74e0f520d3657aceca0159
MD5 9ce0cbf4e1e097f68aac9439c2d0566f
BLAKE2b-256 e4494fda56dd33d5c8054509dbec9fb2a11cd2849fba386c62c013ea452586e7

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0a728cf97bc477257abd760972517fee847e45604d611825e10876b47bbc2d42
MD5 0ffc304ec59f665672b2ac5b7dea55e4
BLAKE2b-256 910eb3c7d237fe3c5b9f024c77b736bebef422cd91440928ba065829858b3bf2

See more details on using hashes here.

File details

Details for the file polars_simed-0.3.2-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.3.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3d4e8c3e908a347053c2ff618ce9eb32c0730ea98444879e24691ea9317a1b8b
MD5 73537491d33e6c81df7f57569111d62b
BLAKE2b-256 72862dcbeaf30a3a144e9e4c046f9214ee848b18e0990cbb303242e48753594a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page