Skip to main content

Rust-based Python library that provides text vectorization and cosine distance calculation, optimized for use with Polars DataFrames.

Project description

polars_countvectorizer

polars_countvectorizer is a Rust-based Python library that provides text vectorization and cosine distance calculation, optimized for use with Polars DataFrames. The library offers a function to compute the cosine distance between two text columns in a Polars DataFrame, making it useful for text similarity and document analysis tasks.

Features

  • Cosine Distance Calculation: Computes the cosine similarity between two text columns in a Polars DataFrame.
  • CountVectorizer Integration: Uses the CountVectorizer from the linfa_preprocessing crate to vectorize text data.
  • Optimized for Performance: Leverages parallel processing via rayon for faster computations.

Installation

To install the package, you will need to have Python and Rust installed. Then, use pip to install the Python bindings:

pip install polars_countvectorizer Alternatively, if you're developing locally, you can build the package using maturin:

First, ensure maturin is installed: pip install maturin Then, run the following command to build and install the package:

maturin develop

Usage

import polars as pl
import polars_countvectorizer

df = pl.DataFrame({
    "doc1": ["apple orange banana", "dog cat mouse", "car bike bus"],
    "doc2": ["apple banana", "cat dog", "bus car bike"]
})

result = df.select(
    polars_countvectorizer.process_cosine_distances_py("doc1", "doc2")
)

print(result)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polars_countvectorizer-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (10.6 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

File details

Details for the file polars_countvectorizer-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_countvectorizer-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b2ee0369984bb6362e24b486115a8b3f53e727b8e611ee7725d3842b3f5aac29
MD5 5b9bb8781ba651f1e62751f3ea0a7f98
BLAKE2b-256 c272afb38cd3d0d60b33eb2c11e81ea940da891112306794bdb018d5e4dd7900

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page