Efficient Pairwise Cosine Similarity Computation
Project description
Efficient Pairwise Cosine Similarity Computation
The (i, j)-entry of the output matrix is the cosine distance between the i-th row of A and the j-th row of B. This function is only a wrapper, it uses the implementation of cosine_similarity from scikit-learn and the implementation of awesome_cossim_topn from sparse_dot_topn. For more details, please check:
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
- https://github.com/ing-bank/sparse_dot_topn
To install this package:
pip install effcossim
Sample code:
from numpy import array
from effcossim.pcs import pairwise_cosine_similarity, pp_pcs
A = array([
[1, 2, 3],
[0, 1, 2],
[5, 1, 1]
])
B = array([
[1, 1, 2],
[0, 1, 2],
[5, 0, 1],
[0, 0, 4]
])
# scikit-learn implementation
M1 = pairwise_cosine_similarity(
A=A, B=B,
efficient=False,
dense_output=True
)
# sparse_dot_topn implementation
M2 = pairwise_cosine_similarity(
A=A, B=B,
efficient=True,
n_top=4,
lower_bound=0.5,
n_jobs=2,
dense_output=True
)
When efficient=True
, in each row of the output matrix only the top n_top
entries above lower_bound
are retained (lower memory impacts). Furthermore, if n_jobs
is larger than 1, parallel computations are applied (higher speed).
If multiple comparisons are required, the parallel implementation can be used.
l1 = [random(m=10000, n=1000, density=0.3,) for _ in range(6)]
l2 = [random(m=10000, n=1000, density=0.3,) for _ in range(6)]
L = pp_pcs(
l1=l1,
l2=l2,
n_workers=2,
efficient=True,
n_top=10,
lower_bound=0.3,
n_jobs=2,
dense_output=False
)
The output is a list where the k-th element is the output of
pairwise_cosine_similarity(l1[k], l2[k])
For further examples, check the notebook.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file effcossim-1.0.4.tar.gz
.
File metadata
- Download URL: effcossim-1.0.4.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0.post20210125 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.6.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d25b9a2ab1d42f2d9e8e41417c749b7df327c5289a2b359ff53f80a5ca53d0c |
|
MD5 | 022283ac680b122258eab51cb52e040b |
|
BLAKE2b-256 | b95e8e9d91f9ea4b9f8a5bf19cfd69ca48161e85db10da50b9d25dbac6ff208a |