Skip to main content

This package boosts a sparse matrix multiplication followed by selecting the top-n multiplication

Project description

sparse_dot_topn

MacOS Linux Windows License ruff

Release_date PyPi Downloads

sparse_dot_topn provides a fast way to performing a sparse matrix multiplication followed by top-n multiplication result selection.

Comparing very large feature vectors and picking the best matches, in practice often results in performing a sparse matrix multiplication followed by selecting the top-n multiplication results.

sparse_dot_topn provides a (parallelised) sparse matrix multiplication implementation that integrates selecting the top-n values, resulting in a significantly lower memory footprint and improved performance. On Apple M2 Pro over two 20k x 193k TF-IDF matrices sparse_dot_topn can be up to 6 times faster when retaining the top 10 values per row and utilising 8 cores. See the benchmark directory for details.

Usage

sp_matmul_topn supports {CSR, CSC, COO} matrices with {32, 64}bit {int, float} data. Note that COO and CSC inputs are converted to the CSR format and are therefore slower. Two options to further reduce memory requirements are threshold and density. Optionally, the values can be sorted such that the first column for a given row contains the largest value. Note that sp_matmul_topn(A, B, top_n=B.shape[1]) is equal to sp_matmul(A, B) and A.dot(B).

If you are migrating from v0.* please see the migration guide below for details.

import scipy.sparse as sparse
from sparse_dot_topn import sp_matmul, sp_matmul_topn

A = sparse.random(1000, 100, density=0.1, format="csr")
B = sparse.random(100, 2000, density=0.1, format="csr")

# Compute C and retain the top 10 values per row
C = sp_matmul_topn(A, B, top_n=10)

# or paralleslised matrix multiplication without top-n selection
C = sp_matmul(A, B, n_threads=2)
# or with top-n selection
C = sp_matmul_topn(A, B, top_n=10, n_threads=2)

# If you are only interested in values above a certain threshold
C = sp_matmul_topn(A, B, top_n=10, threshold=0.8)

# If you set the threshold we cannot easily determine the number of non-zero
# entries beforehand. Therefore, we allocate memory for `ceil(top_n * A.shap[0] * density)`
# non-zero entries. You can set the expected density to reduce the amount pre-allocated
# entries. Note that if we allocate too little an expensive copy(ies) will need to hapen.
C = sp_matmul_topn(A, B, top_n=10, threshold=0.8, density=0.1)

Installation

sparse_dot_topn provides wheels for CPython 3.8 to 3.12 for:

  • Windows (64bit)
  • Linux (64bit)
  • MacOS (x86 and ARM)
pip install sparse_dot_topn

sparse_dot_topn relies on a C++ extension for the computationally intensive multiplication routine. Note that the wheels vendor/ships OpenMP with the extension to provide parallelisation out-of-the-box. This may cause issues when used in combination with other libraries that ship OpenMP like PyTorch. If you run into any issues with OpenMP see INSTALLATION.md for help or run the function without specifying the n_threads argument.

Installing from source requires a C++17 compatible compiler. If you have a compiler available it is advised to install without the wheel as this enables architecture specific optimisations.

You can install from source using:

pip install sparse_dot_topn --no-binary sparse_dot_topn

Build configuration

sparse_dot_topn provides some configuration options when building from source. Building from source can enable architecture specific optimisations and is recommended for those that have a C++ compiler installed. See INSTALLATION.md for details.

Distributing the top-n multiplication of two large O(10M+) sparse matrices over a cluster

The top-n multiplication of two large O(10M+) sparse matrices can be broken down into smaller chunks. For example, one may want to split sparse matrices into matrices with just 1M rows, and do the the (top-n) multiplication of all those matrix pairs. Reasons to do this are to reduce the memory footprint of each pair, and to employ available distributed computing power.

The pairs can be distributed and calculated over a cluster (eg. we use a spark cluster). The resulting matrix-products are then zipped and stacked in order to reproduce the full matrix product.

Here's an example how to do this, where we are matching 1000 rows in sparse matrix A against 600 rows in sparse matrix B, and both A and B are split into chunks.

import numpy as np
import scipy.sparse as sparse
from sparse_dot_topn import sp_matmul_topn, zip_sp_matmul_topn

# 1a. Example matching 1000 rows in sparse matrix A against 600 rows in sparse matrix B.
A = sparse.random(1000, 2000, density=0.1, format="csr", dtype=np.float32, random_state=rng)
B = sparse.random(600, 2000, density=0.1, format="csr", dtype=np.float32, random_state=rng)

# 1b. Reference full matrix product with top-n
C_ref = sp_matmul_topn(A, B.T, top_n=10, threshold=0.01, sort=True)

# 2a. Split the sparse matrices. Here A is split into three parts, and B into five parts.
As = [A[i*200:(i+1)*200] for i in range(5)]
Bs = [B[:100], B[100:300], B[300:]]

# 2b. Perform the top-n multiplication of all sub-matrix pairs, here in a double loop.
# E.g. all sub-matrix pairs could be distributed over a cluster and multiplied there.
Cs = [[sp_matmul_topn(Aj, Bi.T, top_n=10, threshold=0.01, sort=True) for Bi in Bs] for Aj in As]

# 2c. top-n zipping of the C-matrices, done over the index of the B sub-matrices.
Czip = [zip_sp_matmul_topn(top_n=10, C_mats=Cis) for Cis in Cs]

# 2d. stacking over zipped C-matrices, done over the index of the A sub-matrices
# The resulting matrix C equals C_ref.
C = sparse.vstack(Czip, dtype=np.float32)

Migrating to v1.

sparse_dot_topn v1 is a significant change from v0.* with a new bindings and API. The new version adds support for CPython 3.12 and now supports both ints as well as floats. Internally we switched to a max-heap to collect the top-n values which significantly reduces memory-footprint. The former implementation had O(n_columns) complexity for the top-n selection where we now have O(top-n) complexity. awesome_cossim_topn has been deprecated and will be removed in a future version.

Users should switch to sp_matmul_topn which is largely compatible:

For example:

C = awesome_cossim_topn(A, B, ntop=10)

can be replicated using:

C = sp_matmul_topn(A, B, top_n=10, threshold=0.0, sort=True)

API changes

  1. ntop has been renamed to topn
  2. lower_bound has been renamed to threshold
  3. use_threads and n_jobs have been combined into n_threads
  4. return_best_ntop option has been removed
  5. test_nnz_max option has been removed
  6. B is auto-transposed when its shape is not compatible but its transpose is.

The output of return_best_ntop can be replicated with:

C = sp_matmul_topn(A, B, top_n=10)
best_ntop = np.diff(C.indptr).max()

Default changes

  1. threshold no longer 0.0 but disabled by default

This enables proper functioning for matrices that contain negative values. Additionally a different data-structure is used internally when collecting non-zero results that has a much lower memory-footprint than previously. This means that the effect of the threshold parameter on performance and memory requirements is negligible. If the threshold is None we pre-compute the number of non-zero entries, this can significantly reduce the required memory at a mild (~10%) performance penalty.

  1. sort = False, the result matrix is no longer sorted by default

The matrix is returned with the same column order as if not filtering of the top-n results has taken place. This means that when you set top_n equal to the number of columns of B you obtain the same result as normal multiplication, i.e. sp_matmul_topn(A, B, top_n=B.shape[1]) is equal to A.dot(B).

Contributing

Contributions are very welcome, please see CONTRIBUTING for details.

Contributors

This package was developed and is maintained by authors (previously) affiliated with ING Analytics Wholesale Banking Advanced Analytics. The original implementation was based on modified version of Scipy's CSR multiplication implementation. You can read about it in a blog (mirror) written by Zhe Sun.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparse-dot-topn-1.1.5.tar.gz (45.4 kB view details)

Uploaded Source

Built Distributions

sparse_dot_topn-1.1.5-cp312-abi3-win_amd64.whl (420.4 kB view details)

Uploaded CPython 3.12+ Windows x86-64

sparse_dot_topn-1.1.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (264.9 kB view details)

Uploaded CPython 3.12+ manylinux: glibc 2.17+ x86-64

sparse_dot_topn-1.1.5-cp312-abi3-macosx_12_0_x86_64.whl (506.4 kB view details)

Uploaded CPython 3.12+ macOS 12.0+ x86-64

sparse_dot_topn-1.1.5-cp312-abi3-macosx_12_0_arm64.whl (443.4 kB view details)

Uploaded CPython 3.12+ macOS 12.0+ ARM64

sparse_dot_topn-1.1.5-cp311-cp311-win_amd64.whl (421.7 kB view details)

Uploaded CPython 3.11 Windows x86-64

sparse_dot_topn-1.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.7 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

sparse_dot_topn-1.1.5-cp311-cp311-macosx_12_0_x86_64.whl (507.2 kB view details)

Uploaded CPython 3.11 macOS 12.0+ x86-64

sparse_dot_topn-1.1.5-cp311-cp311-macosx_12_0_arm64.whl (443.8 kB view details)

Uploaded CPython 3.11 macOS 12.0+ ARM64

sparse_dot_topn-1.1.5-cp310-cp310-win_amd64.whl (421.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

sparse_dot_topn-1.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.8 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

sparse_dot_topn-1.1.5-cp310-cp310-macosx_12_0_x86_64.whl (506.9 kB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

sparse_dot_topn-1.1.5-cp310-cp310-macosx_12_0_arm64.whl (443.9 kB view details)

Uploaded CPython 3.10 macOS 12.0+ ARM64

sparse_dot_topn-1.1.5-cp39-cp39-win_amd64.whl (423.0 kB view details)

Uploaded CPython 3.9 Windows x86-64

sparse_dot_topn-1.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.8 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

sparse_dot_topn-1.1.5-cp39-cp39-macosx_12_0_x86_64.whl (507.0 kB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

sparse_dot_topn-1.1.5-cp39-cp39-macosx_12_0_arm64.whl (444.0 kB view details)

Uploaded CPython 3.9 macOS 12.0+ ARM64

sparse_dot_topn-1.1.5-cp38-cp38-win_amd64.whl (443.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

sparse_dot_topn-1.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (266.7 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

sparse_dot_topn-1.1.5-cp38-cp38-macosx_12_0_x86_64.whl (505.9 kB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

sparse_dot_topn-1.1.5-cp38-cp38-macosx_12_0_arm64.whl (443.9 kB view details)

Uploaded CPython 3.8 macOS 12.0+ ARM64

File details

Details for the file sparse-dot-topn-1.1.5.tar.gz.

File metadata

  • Download URL: sparse-dot-topn-1.1.5.tar.gz
  • Upload date:
  • Size: 45.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for sparse-dot-topn-1.1.5.tar.gz
Algorithm Hash digest
SHA256 4cc100e1d26b5a675faf89267ea4629cbedebb70edef6154b374ffaf93f3f040
MD5 2f8629f3eb738431aa3db65e21ff29bd
BLAKE2b-256 b3084100f54cd89779064cf943540cd96fcfb31ea0b0cf4f1133d3e44b692c35

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp312-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1c08ab8c2c5c834118a2be1274ab116d4f3470d14665e8771e98e89478d8fadb
MD5 a80641f206d90f7d64bec2b67aed2321
BLAKE2b-256 aa36f5574123f89488022e91662775268bbd8691ef3beb66f5ad7e9ce525055b

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b5d57ed9948363be47dafe6925ed70254597671cbeb08d43e443a0f444b03601
MD5 48281c8131c54d6d975496fdea2474ad
BLAKE2b-256 1d8f724a144122fcd7276068e389ad900f7c9138d00df5686957bed3ff790dd3

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp312-abi3-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp312-abi3-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 f7db995582f07b5c9061c519027abbdc6984370bb6925b423a71f719d3553f0e
MD5 1c0cda3d0953a2259d90d80bc0c4d566
BLAKE2b-256 240590fcab6ef509f600e5f548f39753e713b631f6d2add040d09a4282a9367d

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp312-abi3-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp312-abi3-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 a6cf5b0c47af4a909839dec20f2b51eebb95a2d01c0a41a480fcc9b2b8ea0b3c
MD5 55dc68b1609ffa558906639489f9b00b
BLAKE2b-256 06ffc50653a3226599de2691eb29b8908c51d0d83159e001f38e8c84c4ea57a6

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 719b39d560a5690677f620257dde543f0418f92ea3627b13debbf9424e9b6cd3
MD5 d5c022b42a6e747d61232c0ff1dba318
BLAKE2b-256 3edc44fc87ec2829c31db16de5a4593eb3fe33441193bac893b7631695b83429

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 56e9b6ad5590b9f0253ad7258b3225a4d441f15646a68a6046bc73a29f969c2e
MD5 5c2fd0a311e1ce8a477abe4b35d1e421
BLAKE2b-256 ae44d27847c16ba5eac5607c2bb42b4a3fd685d084fd4d1293716c54aa965977

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp311-cp311-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp311-cp311-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 989c98a49be6fceaca5d1bf9a9c0f9ff8853a2c13bd8861c91f0340fa42078d0
MD5 28b54a2d07d23c39909059e4937cd5da
BLAKE2b-256 872365aa1ff3c8cbb6a25d22528e9aef7087f308f04723f50e62f0cdf44229b6

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp311-cp311-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp311-cp311-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 d8f93312c5ba9ce881f7c7722e2debe123b2b3bcef3f808e54502841b589afc1
MD5 8ac2adde82ed0e41c3875bca26a73732
BLAKE2b-256 5838b33af35844836843d36e5b886c1ca2cb961e5510d058a258b1ff13b0fbbd

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 97456c07cd134b228a945a95736f53bf47dc63209b9a7270c5dc65e60f0e8914
MD5 594a73c2f1e783a695b1186edb163492
BLAKE2b-256 55d02602375a0637cdbe22ec5b99dad79faf387ba140aa93f1941224d912ba74

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47403b158b322c8c958b32ba19df8e9692d254d75cdea69beb86866272659fcd
MD5 2f0f2df1536e989534872d3faecca1c0
BLAKE2b-256 1f7e2c231843b1daf26f5cce83395dee261430637e69dd512615a62fc9daf12a

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 0115ec5058bd34eb6c5b23921cdb7c518e372978b210fb9b46b6e361b111cd59
MD5 af0a2b3592ed7fafd583db6b9412932b
BLAKE2b-256 39bb7245688cb43084fa2833b5a9e12cb67649cf6c3107891624f6d36a467d97

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp310-cp310-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp310-cp310-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 57290295934daed47cb60766e6a90c39ccbda05a7ff07386f558c458eee85177
MD5 1646dec6f299dbd0d19590c33eb7d7e7
BLAKE2b-256 33fe09cd16f6420becd076adbaf26cc404fb02b100b91e5286e517c17c045711

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 fdcf79726e4f410a2778fda172c0bec332c1d783db4668fa00e8de7898160117
MD5 00629206132712682cf7b3903de0b722
BLAKE2b-256 6ed504da5cf9a759f0b1da207ce1c8f914b6a3aad9c30ae8e6d9f5b457a82f14

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 812e4edb7ef1920770c8295d47b7fc4d692ecd2e3d82c756768dd177104f1066
MD5 884a32830307c29ac54f2def5e32d9cc
BLAKE2b-256 d3e4ecec0715ea1e3b81a41ff491c5d444b692827ce8b454931c1b892ca3fd22

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 7638e56ae6c8afda934306d051f209020a4fc505ab32159545781656f6cd9cc9
MD5 9aa1cb2d6729ab0f23bda53698ddb233
BLAKE2b-256 3a844d0b840c7fd99b52f03f371218d7e9ee8bcfc4341bd2afbcdb973befe791

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp39-cp39-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp39-cp39-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 a5d9c31ba9849d7031e859fefdb54b8f9a9bacc3de09a4f721da633afae9aee9
MD5 38f34562c230425322be012b3e5dfa11
BLAKE2b-256 bd64a25b54f5b6f7185fec7e1627a6d821b702d68eb47d589070a5ebb96ad0be

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2f265dc6caf109a139f3662ba13b44b892d726c838c3c68071c70df76cc4544e
MD5 76573a124da61ed8ad8fc8c006064956
BLAKE2b-256 531bfc0fbb400b1e86d56a92bb444b56effce42b33028dca8d84b629c797f56a

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f0f94efe145282248c2b664b3d3d274761da6e2ff6bab6910a5816e709e9cdce
MD5 a65cf840d49b2f4c022ac688facdfd46
BLAKE2b-256 b09bda4a81803560ccb40245da704ba6dd78de9e9dc6c054f0140c15ff7676cd

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 95157ed7bb4c86f641beeae9efc1b76423f2765bdcc4b4408544e2039553e8c7
MD5 b694292a198734cfdd114f579b03b021
BLAKE2b-256 8a9b3d2a4fc0b75f56c1174a387d39b56407519f07b9de23ac69738bc4a2a086

See more details on using hashes here.

File details

Details for the file sparse_dot_topn-1.1.5-cp38-cp38-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for sparse_dot_topn-1.1.5-cp38-cp38-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 a42c122e320419e6ba1bebd2128beb769b9a4aac09abc9eadb177832cd937924
MD5 07c7b77a21e75a7f6ad939eee697535c
BLAKE2b-256 1dfe5878a023457b7a3c561d9ac7fc5641b35e713856657bf8f25e3dfb654f52

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page