Skip to main content

Efficient Pairwise Cosine Similarity Computation

Project description

Efficient Pairwise Cosine Similarity Computation

The (i, j)-entry of the output matrix is the cosine distance between the i-th row of A and the j-th row of B. This function is only a wrapper, it uses the implementation of cosine_similarity from scikit-learn and the implementation of awesome_cossim_topn from sparse_dot_topn. For more details, please check:

To install this package:

pip install effcossim

Sample code:

from numpy import array
from effcossim.pcs import pairwise_cosine_similarity, pp_pcs

A = array([
    [1, 2, 3], 
    [0, 1, 2],
    [5, 1, 1]
])

B = array([
    [1, 1, 2], 
    [0, 1, 2],
    [5, 0, 1], 
    [0, 0, 4]
])

# scikit-learn implementation
M1 = pairwise_cosine_similarity(
    A=A, B=B, 
    efficient=False, 
    dense_output=True
)

# sparse_dot_topn implementation
M2 = pairwise_cosine_similarity(
    A=A, B=B, 
    efficient=True, 
    n_top=4, 
    lower_bound=0.5, 
    n_jobs=2, 
    dense_output=True
)

When efficient=True, in each row of the output matrix only the top n_top entries above lower_bound are retained (lower memory impacts). Furthermore, if n_jobs is larger than 1, parallel computations are applied (higher speed).

If multiple comparisons are required, the parallel implementation can be used.

l1 = [random(m=10000, n=1000, density=0.3,) for _ in range(6)]
l2 = [random(m=10000, n=1000, density=0.3,) for _ in range(6)]

L = pp_pcs(
    l1=l1, 
    l2=l2, 
    n_workers=2, 
    efficient=True, 
    n_top=10, 
    lower_bound=0.3, 
    n_jobs=2, 
    dense_output=False
)

The output is a list where the k-th element is the output of

pairwise_cosine_similarity(l1[k], l2[k])

For further examples, check the notebook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

effcossim-1.0.4.tar.gz (4.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page