Rust-based Python library that provides text vectorization and cosine distance calculation, optimized for use with Polars DataFrames.
Project description
polars_countvectorizer
polars_countvectorizer is a Rust-based Python library that provides text vectorization and cosine distance calculation, optimized for use with Polars DataFrames. The library offers a function to compute the cosine distance between two text columns in a Polars DataFrame, making it useful for text similarity and document analysis tasks.
Features
- Cosine Distance Calculation: Computes the cosine similarity between two text columns in a Polars DataFrame.
- CountVectorizer Integration: Uses the CountVectorizer from the linfa_preprocessing crate to vectorize text data.
- Optimized for Performance: Leverages parallel processing via rayon for faster computations.
Installation
To install the package, you will need to have Python and Rust installed. Then, use pip to install the Python bindings:
pip install polars_countvectorizer Alternatively, if you're developing locally, you can build the package using maturin:
First, ensure maturin is installed: pip install maturin Then, run the following command to build and install the package:
maturin develop
Usage
import polars as pl
import polars_countvectorizer
df = pl.DataFrame({
"doc1": ["apple orange banana", "dog cat mouse", "car bike bus"],
"doc2": ["apple banana", "cat dog", "bus car bike"]
})
result = df.select(
polars_countvectorizer.process_cosine_distances_py("doc1", "doc2")
)
print(result)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_countvectorizer-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: polars_countvectorizer-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 10.6 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2ee0369984bb6362e24b486115a8b3f53e727b8e611ee7725d3842b3f5aac29
|
|
| MD5 |
5b9bb8781ba651f1e62751f3ea0a7f98
|
|
| BLAKE2b-256 |
c272afb38cd3d0d60b33eb2c11e81ea940da891112306794bdb018d5e4dd7900
|