Multi-threaded matrix multiplication and cosine similarity calculations.

These details have not been verified by PyPI

Project links

Project description

ChunkDot

Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for calculating the K most similar items for a large number of items by chunking the item matrix representation (embeddings) and using Numba to accelerate the calculations.

Use for:

dense embeddings
sparse embeddings
similarity calculation versus other embeddings
CosineSimilarityTopK scikit-learn transformer

Usage

pip install -U chunkdot

Dense embeddings

Calculate the 50 most similar and dissimilar items for 100K items.

import numpy as np
from chunkdot import cosine_similarity_top_k

embeddings = np.random.randn(100000, 256)
# using all you system's memory
cosine_similarity_top_k(embeddings, top_k=50)
# most dissimilar items using 20GB
cosine_similarity_top_k(embeddings, top_k=-50, max_memory=20E9)

<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
 with 5000000 stored elements in Compressed Sparse Row format>

# with progress bar
cosine_similarity_top_k(embeddings, top_k=50, show_progress=True)

100%|███████████████████████████████████████████████████████████████| 129.0/129 [01:04<00:00,  1.80it/s]
<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
  with 5000000 stored elements in Compressed Sparse Row format>

Execution time

from timeit import timeit
import numpy as np
from chunkdot import cosine_similarity_top_k

embeddings = np.random.randn(100000, 256)
timeit(lambda: cosine_similarity_top_k(embeddings, top_k=50, max_memory=20E9), number=1)

58.611996899999994

Sparse embeddings

Calculate the 50 most similar and dissimilar items for 100K items. Items represented by 10K dimensional vectors and an embeddings matrix of 0.005 density.

from scipy import sparse
from chunkdot import cosine_similarity_top_k

embeddings = sparse.rand(100000, 10000, density=0.005)
# using all you system's memory
cosine_similarity_top_k(embeddings, top_k=50)
# most dissimilar items using 20GB
cosine_similarity_top_k(embeddings, top_k=-50, max_memory=20E9)

<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
 with 5000000 stored elements in Compressed Sparse Row format>

Execution time

from timeit import timeit
from scipy import sparse
from chunkdot import cosine_similarity_top_k

embeddings = sparse.rand(100000, 10000, density=0.005)
timeit(lambda: cosine_similarity_top_k(embeddings, top_k=50, max_memory=20E9), number=1)

51.87472256699999

Similarity calculation versus other embeddings

Given 20K items, for each item, find the 50 most similar items in a collection of other 10K items.

import numpy as np
from chunkdot import cosine_similarity_top_k

embeddings = np.random.randn(20000, 256)
other_embeddings = np.random.randn(10000, 256)

cosine_similarity_top_k(embeddings, embeddings_right=other_embeddings, top_k=10)

<20000x10000 sparse matrix of type '<class 'numpy.float64'>'
 with 200000 stored elements in Compressed Sparse Row format>

CosineSimilarityTopK scikit-learn transformer

Given a pandas DataFrame with 100K rows and

2 numerical columns
2 categorical columns with 500 categories each

use scikit-learn transformers, the standard scaler for the numerical columns and the one-hot encoder for the categorical columns, to form an embeddings matrix of dimensions 100K x 1002 and then calculate the top 50 most similar rows per each row.

import numpy as np
import pandas as pd

n_rows = 100000
n_categories = 500
df = pd.DataFrame(
    {
        "A_numeric": np.random.rand(n_rows),
        "B_numeric": np.random.rand(n_rows),
        "C_categorical": np.random.randint(n_categories, size=n_rows),
        "D_categorical": np.random.randint(n_categories, size=n_rows),
    }
)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from chunkdot import CosineSimilarityTopK

numeric_features = ["A_numeric", "B_numeric"]
numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

categorical_features = ["C_categorical", "D_categorical"]
categorical_transformer = Pipeline(steps=[("encoder", OneHotEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

cos_sim = CosineSimilarityTopK(top_k=50)

pipe = Pipeline(steps=[("preprocessor", preprocessor), ("cos_sim", cos_sim)])
pipe.fit_transform(df)

<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 5000000 stored elements in Compressed Sparse Row format>

Execution time

from timeit import timeit

timeit(lambda: pipe.fit_transform(df), number=1)

24.45172154181637

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Dec 28, 2024

0.5.0

Jul 11, 2024

0.4.2

Feb 28, 2024

0.4.1

Feb 28, 2024

0.4.0

Feb 28, 2024

0.3.0

Jan 6, 2024

0.2.8

Sep 1, 2023

0.2.6

Jul 6, 2023

0.2.4

Apr 25, 2023

0.2.3

Apr 25, 2023

0.2.2

Apr 24, 2023

0.2.1

Apr 18, 2023

0.2.0

Apr 18, 2023

0.1.5

Apr 10, 2023

0.1.4

Mar 31, 2023

0.1.3

Mar 31, 2023

0.1.2

Mar 31, 2023

0.1.1

Mar 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkdot-0.6.0.tar.gz (10.9 kB view details)

Uploaded Dec 28, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunkdot-0.6.0-py3-none-any.whl (12.2 kB view details)

Uploaded Dec 28, 2024 Python 3

File details

Details for the file chunkdot-0.6.0.tar.gz.

File metadata

Download URL: chunkdot-0.6.0.tar.gz
Upload date: Dec 28, 2024
Size: 10.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.11.10 Darwin/23.1.0

File hashes

Hashes for chunkdot-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`62556ecc66642f6062f17703d434a871522e439941a69c2e09edd519d81195d2`
MD5	`1fe3b8ca5b5532eddfed413d6ef4f885`
BLAKE2b-256	`a4d1f8abf405e92b6639cb74d5837dd7c6ee9bf5676c239d9ec6ff0e9df72255`

See more details on using hashes here.

File details

Details for the file chunkdot-0.6.0-py3-none-any.whl.

File metadata

Download URL: chunkdot-0.6.0-py3-none-any.whl
Upload date: Dec 28, 2024
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.11.10 Darwin/23.1.0

File hashes

Hashes for chunkdot-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dff6e04cc61a238ac23f64d6777135f38281efb7750d3560e05a405f3c2fde47`
MD5	`3e550514ff240741b18028dc4fbd4fd3`
BLAKE2b-256	`e85c7dcb8c90805cb78f0595ddcf1fbddf1859b1836e5a58f3da26436ef9c431`

See more details on using hashes here.

chunkdot 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChunkDot

Related blog posts

Usage

Dense embeddings

Sparse embeddings

Similarity calculation versus other embeddings

CosineSimilarityTopK scikit-learn transformer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes