Skip to main content

Fast pairwise cosine distance calculation and numba accelerated evolutionary matrix subset extraction

Project description

Logo

/sʌb.pɛɹ/

SubPair CI

"All you need is love and evolutionary matrix subset extraction." - J. Lennon

Pairwise cosine distance is great to easily compare many vectors. However, you can end up with a very sizeable distance matrix. What if you would like to find a small subset of that matrix? Let's search it by evolution.

Given N elements and their (N,N) pairwise distance matrix we would like to get the subset of S elements such that the sum of elements in the corresponding (S,S) submatrix is minimal. See example below.

  [0  1  2  3  4] indeces 
      i  j     k    
      │  │     │          i j k   = [1, 2, 4]
   0  1  6  4  1                   
i──1  0  3  1  7       i  0 3 7     
j──6  3  0  2  3  -->  j  3 0 3  -->  7 + 3 + 3 = 13 👎
   4  1  2  0  1       k  7 3 0
k──1  7  3  1  0

         i  j  k    
         │  │  │          i j k  = [2, 3, 4]   
   0  1  6  4  1                   
   1  0  3  1  7       i  0 2 3     
i──6  3  0  2  3  -->  j  2 0 1  -->  2 + 1 + 3 = 6 👍
j──4  1  2  0  1       k  3 1 0
k──1  7  3  1  0

All the possible subsets are ${N}\choose{S}$ and for N = 1024, S = 20 (like in the tests) we would have to check ${1024}\choose{20}$ $= 5.479 \times 10^{41}$ of them.

A few too many. Instead we are going to use an evolutionary approach to search for it.

Example usage

The usage is quite straight forward since there are only a couple of functions exported pairwise_cosine and extract.

>>> import matplotlib.pyplot as plt
>>> from subpair import pairwise_cosine
>>>
>>> X = np.random.rand(N, K).astype(np.float32)
>>> distances = pairwise_cosine(X) # (N,N)
>>> ...
>>> best, stats = extract(distances, P=200, S=S, K=50, M=3, O=2, its=3_000)
100%|█████████████████████████████████| 3000/3000 [00:03<00:00, 817.42it/s]
>>> plt.plot(stats["fits"]); plt.show()

Logo

(We have sprinkled a few negative numbers to see if the algorithm can find them)

Where the options of extract are parameters for the evolutionary algorithm:
distances (int, int) : N vectors of length L
        P (int)      : population size
        S (int)      : desired subset size <- determines size of output
        K (int)      : number of parents (P-K children)
        M (int)      : number of mutations
        O (int)      : fraction of crossovers e.g. O=2 -> 1/2, O=10 -> 1/10, (bigger=faster)

Note

This repo contains both numpy and numba/CUDA versions of the pairwise cosine distance matrix calculation. But numpy is already blazingly fast so the cuda version is provided mostly for inspiration. Our numpy version is very similar to sklearn's metrics.pairwise.cosine_distances but slightly faster. Sklearn's one has some extra nicities that our simplified version does not have.

> python flops.py # On Macbook pro M1 Max
N=513 K=2304 GOPs=1
  sklearn: 0.01s - 109.4 GFLOPS
    numpy: 0.00s - 162.4 GFLOPS

N=1027 K=2304 GOPs=2
  sklearn: 0.02s - 135.9 GFLOPS
    numpy: 0.01s - 192.4 GFLOPS

N=2055 K=2304 GOPs=10
  sklearn: 0.07s - 142.9 GFLOPS
    numpy: 0.06s - 166.0 GFLOPS

N=4111 K=2304 GOPs=39
  sklearn: 0.20s - 195.8 GFLOPS
    numpy: 0.16s - 248.6 GFLOPS

N=8223 K=2304 GOPs=156
  sklearn: 0.61s - 255.3 GFLOPS
    numpy: 0.54s - 289.5 GFLOPS

N=16447 K=2304 GOPs=623
  sklearn: 2.11s - 295.4 GFLOPS
    numpy: 1.79s - 347.9 GFLOPS

Todo

  • Add type info to minimize.py to allow for AOT compilation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subpair-0.1.1.tar.gz (7.6 kB view hashes)

Uploaded Source

Built Distribution

subpair-0.1.1-py3-none-any.whl (7.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page