Fast similarity join for polars DataFrames.
Project description
polars_sim
Description
Implements an approximate join of two polars dataframes based on string columns.
Right now, we use a fixed vectorization, which is applied on the fly and eventually used in a sparse matrix multiplication combined with a top-n selection. This produces the cosine similarities of the individual string pairs.
The join_sim
function is similar to join_asof
but for strings instead of timestamps.
Installation
pip install polars_sim
Development
We use uv for python package management. Furthermore, you need rust to be installed, see install rust. You won't need to activate an enviroment by yourself at any point. This is handled by uv. To get started, run
# create a virtual environment
uv venv --seed -p 3.11
# install dependencies
uv pip install -e .
# install dev dependencies
uv pip install -r requirements.txt
# compiple rust code
make install
# run tests
make test
Usage
import polars as pl
import polars_sim as ps
df_left = pl.DataFrame(
{
"name": ["Alice", "Bob", "Charlie", "David"],
}
)
df_right = pl.DataFrame(
{
"name": ["Ali", "Alice in Wonderland", "Bobby", "Tom"],
}
)
df = ps.join_sim(
df_left,
df_right,
on="name",
ntop=4,
)
shape: (3, 3)
┌───────┬──────────┬─────────────────────┐
│ name ┆ sim ┆ name_right │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str │
╞═══════╪══════════╪═════════════════════╡
│ Alice ┆ 0.57735 ┆ Ali │
│ Alice ┆ 0.522233 ┆ Alice in Wonderland │
│ Bob ┆ 0.57735 ┆ Bobby │
└───────┴──────────┴─────────────────────┘
References
The implementation is based on an algorithm used in sparse_dot_topn, which itself is an improvement of the scipy sparse matrix multiplication.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file polars_sim-0.1.2.tar.gz
.
File metadata
- Download URL: polars_sim-0.1.2.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 87e6dda04e14e0b1690ea29614b9eed39f459f3ab56534d0d4efd9419ebcfc23 |
|
MD5 | 992298a0f8120d6a3502c4ee00f82f97 |
|
BLAKE2b-256 | f60040083072632249dd8624ec06e11e62c4bcade5abcc296726a8066d0357e8 |
File details
Details for the file polars_sim-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: polars_sim-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.9 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c315ee73a6161bc48fbf64817185cabd14fb8ad17180b615fa7a003257ffac8b |
|
MD5 | 9dca2e7fe70785527b7e2ae4b8510b57 |
|
BLAKE2b-256 | acd025f13995f74fa64b5fd55f12cf1ba106fb5e37f75727d3f8211c5c4c59f9 |
File details
Details for the file polars_sim-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
.
File metadata
- Download URL: polars_sim-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 3.8 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a31d3894550a9f773d736800c3a54609f0eec62fb96a7efe0aed47929835255e |
|
MD5 | 66e76a77d30ff3fde420a2c930d0841d |
|
BLAKE2b-256 | 4cbe8cb1aa037331cca19bbeb282c42798d980451a4d068e9146f8d981c07195 |
File details
Details for the file polars_sim-0.1.2-cp38-abi3-macosx_11_0_arm64.whl
.
File metadata
- Download URL: polars_sim-0.1.2-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dde82e74bc7d5f375b817d43699ae85832386dc94b234d17947992b547f08f7f |
|
MD5 | c0d84aa5bed37a63ecb6edf819bdb257 |
|
BLAKE2b-256 | cb04509e94f8eead5222951fb723f7a0c5848650e261c1277cc209012fa75f5f |
File details
Details for the file polars_sim-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl
.
File metadata
- Download URL: polars_sim-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 3.6 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 755d8ac5d23020116564e8628c6db7cdb2c392256ce4842b19cf0306a13080d3 |
|
MD5 | db2b4a3dd5b92a24f1386c8f71f5ec15 |
|
BLAKE2b-256 | 719bf9e11ce4f2a14b8e697dcaa1688c8dce78e4845f18d41ca2d11c6a50feb8 |