Skip to main content

Fast similarity join for polars DataFrames.

Project description

polars_sim

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to join_asof but for strings instead of timestamps.

Installation

pip install polars_sim

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# create a virtual environment
uv venv --seed -p 3.11
# install dependencies
uv pip install -e .
# install dev dependencies
uv pip install -r requirements.txt
# compiple rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_sim as ps

df_left = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["Ali", "Alice in Wonderland", "Bobby", "Tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    ntop=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f64       str                 
╞═══════╪══════════╪═════════════════════╡
 Alice  0.57735   Ali                 
 Alice  0.522233  Alice in Wonderland 
 Bob    0.57735   Bobby               
└───────┴──────────┴─────────────────────┘

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sim-0.1.3.tar.gz (46.8 kB view details)

Uploaded Source

Built Distributions

polars_sim-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_sim-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

polars_sim-0.1.3-cp38-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_sim-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_sim-0.1.3.tar.gz.

File metadata

  • Download URL: polars_sim-0.1.3.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for polars_sim-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e8141be3f6a7cdcb0b7cda9e2946888e533a7e6154a82062ce82e34ce880e30b
MD5 30f52e0928018321a05d725615d694a1
BLAKE2b-256 08c0ae87f48cb34e3731e9464dd5ab41e03df320c281cd5dfbcd05c7452f70b0

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f82b6f7bf8dbf66db447592252096a8669ef5ed8ff595079fa45b24a9f6b0a18
MD5 bdd6b3b5c9d343480827e253773a2beb
BLAKE2b-256 55d65ec309ab56e53da849962f63631f9dd8f6633dc91018b2babe0812e9b0dd

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6a1556c6c4ee64813c696e1953da3b077f2d56443e9b4e8b0f8f53ef6d38889d
MD5 33cc36d8bbe9c79df714f7cfa8490f03
BLAKE2b-256 44a6acff75d8776d1a8dac2873ad73d9ccba2bc1af246b6979e921bc0f94d857

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.3-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bcfc5adc9582b26c387782b4b85263b6a167ff97fd2c2c5ec5799e903a95bb77
MD5 1d385df041ff50881b07eee09d37c331
BLAKE2b-256 b190c0e399595c82f1d18360a8ac1dd25be2d33d10c42afbe220dd03fcdf5b39

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9fdc14ad09b07c82f3873e54af8414c423811e03e15f8684141165b8fb9eead6
MD5 03fb18426f70775aace2d64f7bb28b35
BLAKE2b-256 26396d3c083cc02b2a4ef8b3f4ba64b4fc521a999ffc4517acc0650bb6a62644

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page