Skip to main content

Fast similarity join for polars DataFrames.

Project description

polars_sim

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to a left join or join_asof but for strings instead of timestamps.

Installation

pip install polars_sim

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# install python dependencies and compile the rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_sim as ps

df_left = pl.DataFrame(
    {
        "name": ["alice", "bob", "charlie", "david"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["ali", "alice in wonderland", "bobby", "tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    top_n=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f32       str                 
╞═══════╪══════════╪═════════════════════╡
 alice  0.57735   ali                 
 alice  0.522233  alice in wonderland 
 bob    0.57735   bobby               
└───────┴──────────┴─────────────────────┘

Performance

A benchmark can be executed with make run-bench. In general, the performance heavily depends on the length of the dataframes. By default, the computation is parallelized over the left dataframe. However, serveral benchmarks showed that if the right dataframe is much bigger than the left dataframe and no normalization is applied, it is faster to parallelize over the right dataframe.

If no normalization is applied, the performance is usually better since the a small uint type will be used for the sparse matrix multiplication, e.g. u16. Otherwise, all types will be of 32 bit size.

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sim-0.3.0.tar.gz (45.3 kB view details)

Uploaded Source

Built Distributions

polars_sim-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_sim-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

polars_sim-0.3.0-cp38-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_sim-0.3.0-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_sim-0.3.0.tar.gz.

File metadata

  • Download URL: polars_sim-0.3.0.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for polars_sim-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3b2817b9cd7269f3c8f36fdbb4b3ca90973a58a9c17360ba2755bb373e2b20b7
MD5 4bca5c8620a5b0c2f699493d21d1589b
BLAKE2b-256 a4ca30e254fe7f716f5ea5657871f49e5d37c82905e1d97773d181c612dc14d3

See more details on using hashes here.

File details

Details for the file polars_sim-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.3.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ca9d09d1ff302b7b3e4381a0b9c90ebf92cb39abb57247d58e54e2b86b8651e4
MD5 ba1c4b39ac380eb5145792f1f695c28c
BLAKE2b-256 2d66ee0a389ffc1988a4010767c3a6cabb13f209b07c86f15b79dc69bbb6261f

See more details on using hashes here.

File details

Details for the file polars_sim-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.3.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cc1dd9dde3f7c0563ccb67e3a9d5f7936f166479cf95af7180b65ca0e4faed77
MD5 a2222c6d3aa113f1f7e292fdb860e880
BLAKE2b-256 4e62a77a989d7a79806f36b94fe03e3edf33a300914015d39773980fb30158b3

See more details on using hashes here.

File details

Details for the file polars_sim-0.3.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sim-0.3.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64eed0fe4365b74391601bd9e755d90efebdbd622a1b96aa1a1f6cf6aaab226d
MD5 846df6b7392e53b4fae331ad072be0dd
BLAKE2b-256 fcc2714cc2394e5f3c58a571ec95c4faa5614ec1605a0342d682788f9c5d1e05

See more details on using hashes here.

File details

Details for the file polars_sim-0.3.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.3.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 44b5fd613e2c513fcef2eea8b3239898ee016a6a3b8d9b4fc36abc737ad94e50
MD5 6f0962ead62b81abd5413f7eb88c209e
BLAKE2b-256 572102f66c84fb99361c127bc217b44916de1fb0814ff3a2377f812272e74a1e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page