Skip to main content

Fast similarity join for polars DataFrames.

Project description

polars_sim

Description

Implements an approximate join of two polars dataframes based on string columns.

Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.

The join_sim function is similar to join_asof but for strings instead of timestamps.

Installation

pip install polars_sim

Usage

import polars as pl
import polars_sim as ps

df_left = pl.DataFrame({
    "name" : ["Alice", "Bob", "Charlie", "David"],
})

df_right = pl.DataFrame({
    "name" : ["Ali", "Bobby", "Tom"],
})

df = ps.join_sim(
    df_left,
    df_right,
    on="name",
    ntop=2,
)

shape: (2, 5)
┌─────┬───────┬─────┬──────┬────────────┐
 row  name   col  data  name_right 
 ---  ---    ---  ---   ---        
 u32  str    u32  i64   str        
╞═════╪═══════╪═════╪══════╪════════════╡
 0    Alice  1    2     Alicia     
 1    Bob    2    1     Bobby      
└─────┴───────┴─────┴──────┴────────────┘

Notes

The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_sim-0.1.0.tar.gz (30.8 kB view details)

Uploaded Source

Built Distributions

polars_sim-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

polars_sim-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

polars_sim-0.1.0-cp38-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

polars_sim-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

File details

Details for the file polars_sim-0.1.0.tar.gz.

File metadata

  • Download URL: polars_sim-0.1.0.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for polars_sim-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6149687f772f498aebbd0b51264d20eaed9bc7fadacf4b90d70f3b2a98bffbd7
MD5 df05b6cba9d78b4f0a000b49f50e2943
BLAKE2b-256 d5e70d60501a924b2b2165df349eaf2c7856efa9002318eeab23af7970545bd5

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aaf66c82503c46dfcf3e4f949a83208d2200917ca135d625b5949cf14db25196
MD5 7d2676ead48407b2ce639533b3b20ea5
BLAKE2b-256 5aebee31c5bf0dfd99ddc6f944184ce1175ce159495596b84e5395b809e670a5

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9debcdc7c8ea8a80a918dc217da0801032bd9340ba81d40c3dd7917807f279d3
MD5 8946b93f4dd1d04e15768084557e0ab5
BLAKE2b-256 5a097b1edf6622d753d829430a666587552adb98215f74db383110553acc127b

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8495bc600b45c10ce75790563af5ef7725b8a108adb2a851d2ba0cc876fb168
MD5 0f99a1a904104779aae1e8376886963c
BLAKE2b-256 7dbca15c2443c6ea3ce3b4b6fd01adb1cd4aec45f834d096154d0d50ce2c76d0

See more details on using hashes here.

File details

Details for the file polars_sim-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for polars_sim-0.1.0-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e4b03aab24216b40f5b8bdb7d7ae4ed3a362611bab13ac6d969205004e3e9fe4
MD5 6b216edbba7cb02cfb39e2214e5140c2
BLAKE2b-256 2da6cce27312863a74fb110a5e71b5faff0c5c5c6c70977208d96deefa31ce75

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page