Skip to main content

Fast similarity join for polars DataFrames.

Project description

polars_sim

Polars >= 1 Compatibility Test

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to a left join or join_asof but for strings instead of timestamps.

Installation

pip install polars_sim

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# install python dependencies and compile the rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_sim as ps

df_left = pl.DataFrame(
    {
        "name": ["alice", "bob", "charlie", "david"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["ali", "alice in wonderland", "bobby", "tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f32       str                 
╞═══════╪══════════╪═════════════════════╡
 alice  0.57735   ali                 
 alice  0.522233  alice in wonderland 
 bob    0.57735   bobby               
└───────┴──────────┴─────────────────────┘

Performance

A benchmark can be executed with make run-bench. In general, the performance heavily depends on the length of the dataframes. By default, the computation is parallelized over the left dataframe. However, serveral benchmarks showed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe.

If no normalization is applied, the performance is usually better since the a small uint type will be used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sim-0.4.1.tar.gz (59.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_sim-0.4.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

polars_sim-0.4.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.9 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

polars_sim-0.4.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

polars_sim-0.4.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.9 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

polars_sim-0.4.1-cp38-abi3-macosx_11_0_arm64.whl (4.7 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

polars_sim-0.4.1-cp38-abi3-macosx_10_12_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file polars_sim-0.4.1.tar.gz.

File metadata

  • Download URL: polars_sim-0.4.1.tar.gz
  • Upload date:
  • Size: 59.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for polars_sim-0.4.1.tar.gz
Algorithm Hash digest
SHA256 4cfdfa4a748a127937932371d0fb2c2c252b4dd309911c75a9168300acb88a1e
MD5 73b3378f3eb58f5daa0fd6f10c738487
BLAKE2b-256 442ff44631cae4d760ea136a954c8094f4d5c610050a4bdc76e33ff63bba39f0

See more details on using hashes here.

File details

Details for the file polars_sim-0.4.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.4.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f5abacc309cf7416887dc3e59fc9a477e352f565538e69e3a10ff3f2e115e340
MD5 7f8184bb6b12b921dfe0021284560770
BLAKE2b-256 6462f5317afa5075e2f73c42709023b123e19901e1352fdf10f330328eed460a

See more details on using hashes here.

File details

Details for the file polars_sim-0.4.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.4.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5f5eb5188655b99380a2a1c0ce3b997707cdcfb9c87fb08f256b464dd5641cac
MD5 167f3858cca45aaa1286d5172b90d485
BLAKE2b-256 7b5c1bb44e2307ec7918b9f6c7a8018d413c6da7c55368c18c1cdbdd303fc954

See more details on using hashes here.

File details

Details for the file polars_sim-0.4.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.4.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 12094a747e9bd904dcb7be1a09b1f3270ae6ef66a48500fd5306ffa9b734110a
MD5 036ccbb0dd72fc2f6b084655d6656771
BLAKE2b-256 f55a15c6cdf374f3a889a3edbf251f1e3d734ba31e36e6ee4abcabeb578f0656

See more details on using hashes here.

File details

Details for the file polars_sim-0.4.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.4.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 58bb41b3345642d3a8154fd34f63b172baf45f9da78b13babd19651a736d6ecf
MD5 ea8d9579308609b3e915abb7b6f96acf
BLAKE2b-256 bda4159714a496a095bb9bc8cb0cfbdb6a1808f037f97615b01816dac759d3f8

See more details on using hashes here.

File details

Details for the file polars_sim-0.4.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sim-0.4.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c246ec25f4d351483b73330ecee317b4c11d12e8e008d504d836e27bf36cace0
MD5 2288a63ba125fefe741d4fc5ac37443c
BLAKE2b-256 f7a38e91a66576347fbfd52a07d3164b687923cb700074abde0c54803b75bafa

See more details on using hashes here.

File details

Details for the file polars_sim-0.4.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.4.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 dcaec104ca73b393b1e0ad2fbfc7df9b5056500d50e0c689cc29fec5c7256670
MD5 3e0f05eca866c8e5347a5fd7c81c2bbb
BLAKE2b-256 8ff450f28f7cfe27abcf8b066aa90d401d44b61657320d0094442851d7605a0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page