Skip to main content

Fast similarity join for polars DataFrames.

Project description

polars_sim

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to join_asof but for strings instead of timestamps.

Installation

pip install polars_sim

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# create a virtual environment
uv venv --seed -p 3.11
# install dependencies
uv pip install -e .
# install dev dependencies
uv pip install -r requirements.txt
# compiple rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_sim as ps

df_left = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["Ali", "Alice in Wonderland", "Bobby", "Tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    ntop=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f64       str                 
╞═══════╪══════════╪═════════════════════╡
 Alice  0.57735   Ali                 
 Alice  0.522233  Alice in Wonderland 
 Bob    0.57735   Bobby               
└───────┴──────────┴─────────────────────┘

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sim-0.1.1.tar.gz (32.2 kB view details)

Uploaded Source

Built Distributions

polars_sim-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_sim-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

polars_sim-0.1.1-cp38-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_sim-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_sim-0.1.1.tar.gz.

File metadata

  • Download URL: polars_sim-0.1.1.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for polars_sim-0.1.1.tar.gz
Algorithm Hash digest
SHA256 adb26d0535bbe32f9ad391bb96e17aa4ee0dcc34886786230c1ca4c0ed1d9c51
MD5 041060caf3a2d11205a468a8503b79b0
BLAKE2b-256 ad6a0b887f1dc4b5d974c8dc693ba6ac208f835a79a45e487e9becefe8dbb43f

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8cc23ce16db37bf0809aa53d11a411eee950b6634b397a71dcc35624261fea0
MD5 c7d78fb2a0fa45f3766194608a827f19
BLAKE2b-256 2b70dc7285e5efdb9762adbd990b01b4addc7451c9feb82e16c3a9c4acc7d5d6

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.1-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7793b361ce4ed60518a4dc1cf54ee5376e4c141dfcbd54f041d4c63df0137110
MD5 100d3d307ed141a937d387da53a7ac53
BLAKE2b-256 1b4b7cf72960d64022d1586c758bce6d6d43615f8fc74faf46db3695d12c227d

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9c85b5d2e5f1b7cd861eb357e6dc33a21de888ffad8565efa77fe51c654c1598
MD5 e334786b07817d33aae87d3dca9ae36c
BLAKE2b-256 f549e5c5dc1ec99484e9abd50d8cc978cedd5af3becb2fd61056380ea55d5bfc

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.1-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f532c1be0749ccef54c6ab2f8f0ef6f1de7578d6d613d4fe846e710d4f3b7ed2
MD5 3144e4c0e78a13bf9560991411eb04b1
BLAKE2b-256 239645554b1135d7da98a55b0de1769e7508f7f003e54a7e4b321a4c4bdd9d70

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page