Skip to main content

Fast similarity join for polars DataFrames.

Project description

polars_sim

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to join_asof but for strings instead of timestamps.

Installation

pip install polars_sim

Development

We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run

# create a virtual environment
uv venv --seed -p 3.11
# install dependencies
uv pip install -e .
# install dev dependencies
uv pip install -r requirements.txt
# compiple rust code
make install 
# run tests
make test

Usage

import polars as pl
import polars_sim as ps

df_left = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)

df_right = pl.DataFrame(
    {
        "name": ["Ali", "Alice in Wonderland", "Bobby", "Tom"],
    }
)

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    ntop=4,
)

shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
 name   sim       name_right          
 ---    ---       ---                 
 str    f64       str                 
╞═══════╪══════════╪═════════════════════╡
 Alice  0.57735   Ali                 
 Alice  0.522233  Alice in Wonderland 
 Bob    0.57735   Bobby               
└───────┴──────────┴─────────────────────┘

References

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sim-0.1.2.tar.gz (32.6 kB view details)

Uploaded Source

Built Distributions

polars_sim-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_sim-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

polars_sim-0.1.2-cp38-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_sim-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_sim-0.1.2.tar.gz.

File metadata

  • Download URL: polars_sim-0.1.2.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for polars_sim-0.1.2.tar.gz
Algorithm Hash digest
SHA256 87e6dda04e14e0b1690ea29614b9eed39f459f3ab56534d0d4efd9419ebcfc23
MD5 992298a0f8120d6a3502c4ee00f82f97
BLAKE2b-256 f60040083072632249dd8624ec06e11e62c4bcade5abcc296726a8066d0357e8

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c315ee73a6161bc48fbf64817185cabd14fb8ad17180b615fa7a003257ffac8b
MD5 9dca2e7fe70785527b7e2ae4b8510b57
BLAKE2b-256 acd025f13995f74fa64b5fd55f12cf1ba106fb5e37f75727d3f8211c5c4c59f9

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a31d3894550a9f773d736800c3a54609f0eec62fb96a7efe0aed47929835255e
MD5 66e76a77d30ff3fde420a2c930d0841d
BLAKE2b-256 4cbe8cb1aa037331cca19bbeb282c42798d980451a4d068e9146f8d981c07195

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dde82e74bc7d5f375b817d43699ae85832386dc94b234d17947992b547f08f7f
MD5 c0d84aa5bed37a63ecb6edf819bdb257
BLAKE2b-256 cb04509e94f8eead5222951fb723f7a0c5848650e261c1277cc209012fa75f5f

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 755d8ac5d23020116564e8628c6db7cdb2c392256ce4842b19cf0306a13080d3
MD5 db2b4a3dd5b92a24f1386c8f71f5ec15
BLAKE2b-256 719bf9e11ce4f2a14b8e697dcaa1688c8dce78e4845f18d41ca2d11c6a50feb8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page