Skip to main content

Fast similarity join for polars DataFrames. Fork by Edwardvaneechoud with fixes.

Project description

polars_simed

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to a left join or join_asof but for strings instead of timestamps.

Installation

pip install polars_simed

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# install python dependencies and compile the rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_simed as ps

df_left = pl.DataFrame(
    {
        "name": ["alice", "bob", "charlie", "david"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["ali", "alice in wonderland", "bobby", "tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f32       str                 
╞═══════╪══════════╪═════════════════════╡
 alice  0.57735   ali                 
 alice  0.522233  alice in wonderland 
 bob    0.57735   bobby               
└───────┴──────────┴─────────────────────┘

Performance

A benchmark can be executed with make run-bench. In general, the performance heavily depends on the length of the dataframes. By default, the computation is parallelized over the left dataframe. However, serveral benchmarks showed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe.

If no normalization is applied, the performance is usually better since the a small uint type will be used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_simed-0.4.0.tar.gz (54.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_simed-0.4.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.7 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

polars_simed-0.4.0-cp38-abi3-win_amd64.whl (5.6 MB view details)

Uploaded CPython 3.8+Windows x86-64

polars_simed-0.4.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

polars_simed-0.4.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (5.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

polars_simed-0.4.0-cp38-abi3-macosx_11_0_arm64.whl (4.2 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

polars_simed-0.4.0-cp38-abi3-macosx_10_12_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file polars_simed-0.4.0.tar.gz.

File metadata

  • Download URL: polars_simed-0.4.0.tar.gz
  • Upload date:
  • Size: 54.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for polars_simed-0.4.0.tar.gz
Algorithm Hash digest
SHA256 53f3252d791e6ac238bf8cb239f1e8aa5d189b98c704e31d20f60be7c727e899
MD5 2b04e2509b0e079ef1aac79fedb3097a
BLAKE2b-256 460726ea6f524864a8f82d7f56504f3162cc5535861a50ce383f09f20ea9a5b7

See more details on using hashes here.

File details

Details for the file polars_simed-0.4.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.4.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 16c4738eb031b51212b898d5e8ecad2ff492d853367c28e505431e5cee6347b6
MD5 7b3c724690f73239af29d4525d3e8758
BLAKE2b-256 268969ea1c2adc91d98955f426a21a882037b91db1905586a1f591e74154170b

See more details on using hashes here.

File details

Details for the file polars_simed-0.4.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_simed-0.4.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3384e5ff109c8c4e7fb42bd5562f57d8b7622e769c089fe645aa869546116568
MD5 48b3f4abf1be8020f177df2398252ba1
BLAKE2b-256 2af76e9651cbed700b583f17cbfe52c4bc1cdadf80e0435692ed6a0a937098f4

See more details on using hashes here.

File details

Details for the file polars_simed-0.4.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.4.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4229917cc08fa021eaccc10322b6bf570f30977e1a5dafa00929e3fe7dc4f376
MD5 de1fcf31565be99c7919965223dea484
BLAKE2b-256 cbe51b2efd88b1bdffa8c1b7abc7b47b1ff5048d6983caf03636397cfcb233a2

See more details on using hashes here.

File details

Details for the file polars_simed-0.4.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_simed-0.4.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 bc221942ad649ea02b66b389f7993f95bb5a2adc4e99aece539947ea60131ec6
MD5 79d3a09837c90d5f1e7278784c77fc7e
BLAKE2b-256 c0757a409dd47037333ca068f939f3d479809fe23e02d57f59c52fe0ba5fb2ad

See more details on using hashes here.

File details

Details for the file polars_simed-0.4.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_simed-0.4.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6786325fe6ec4dd09be1fa2fee47bc0895dc4722ad09c1a64787e14d181442ed
MD5 0bb0b2949d4b2fa97b5f9a9d72056f0b
BLAKE2b-256 dfdbc9c66710da8eab51d0baea0c8395977972c1e3424f8abacd98fff735732d

See more details on using hashes here.

File details

Details for the file polars_simed-0.4.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_simed-0.4.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 46853d87af5c769276297db797fac1b849fab80acf1c71cffa0f440c7d5b8f2b
MD5 3a156a13f5b0b50e56c955907f516625
BLAKE2b-256 fd0a0f0d95d19fa2ebc8f31e16096654ec340b52aff37b4a3a9288162d27813c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page