Skip to main content

Fast all-pairs cosine-similarity search via random projection sketches

Project description

sketchsort

Fast all-pairs cosine-similarity search via random projection sketches.

sketchsort finds pairs of high-dimensional float vectors whose cosine distance is below a given threshold. Here, the cosine distance is defined as 1 - cosine_similarity, where cosine_similarity = ⟨x, y⟩ / (‖x‖ · ‖y‖). Input vectors do not need to be pre-normalized; norms are computed internally.

The algorithm sketches vectors into binary sequences by random projection, then enumerates near-duplicate sketches using the multiple-sorting technique of Tabei et al. (2010). The missing-edge-ratio parameter (missing_ratio) controls how exhaustive this enumeration is. See the Python API section below for details.

Install

pip install sketchsort

Wheels are provided for CPython 3.9–3.13 on Linux (x86_64) and macOS arm64 (Apple Silicon). Intel macOS and Windows are not yet provided as wheels; on those platforms pip install sketchsort will fall back to a source build (requires a C++17 compiler and CMake).

Python API

import numpy as np
import sketchsort

X = np.loadtxt("dat/sample.txt", dtype=np.float32)   # shape (N, D)

pairs = sketchsort.search(
    X,
    cos_dist=0.01,         # report pairs with cosine distance <= 0.01
    missing_ratio=0.0001,  # target bound on expected missed true-neighbor fraction
    seed=42,
)

for id1, id2, d in pairs:
    print(id1, id2, d)

X must be a 2D NumPy array of shape (N, D). float32 is recommended; other floating-point dtypes are converted internally. id1 and id2 in the result are row indices in X.

sketchsort.search(...) returns a NumPy structured array with fields id1 (uint32), id2 (uint32), and cos_dist (float32). The output contains each pair at most once, never the self-pair (i, i). For a fixed seed the output is deterministic, but the order is the algorithm's internal enumeration order — not sorted by distance, and id1 < id2 is not guaranteed.

Smaller missing_ratio makes the search more exhaustive, reducing the upper bound on the expected fraction of missed true neighbor pairs at the cost of more time and memory. The bound is derived from the random-projection model and applies to the expectation, not to every individual run.

seed (default 0) seeds the random-number generator used to draw the random projection vectors. For a fixed build, two calls with the same seed, parameters, and inputs produce the same output. To reproduce the non-deterministic behaviour of upstream 0.0.8, pass a time-based seed explicitly, e.g. seed=int(time.time()).

Two additional optional parameters:

  • centering (default False) — when set to True, the coordinate-wise mean of X is subtracted from every row before both sketching and distance computation. In this mode, the reported cos_dist is the cosine distance between the mean-shifted vectors, not the raw input vectors. Recommended when input vectors are non-negative and share a strong bias (e.g. raw bag-of-words counts, histograms, molecular fingerprints — the original SketchSort use case).
  • verbose (default False) — when set to True, the underlying C++ core prints algorithm progress to stdout/stderr. Default is quiet; turn on for diagnostics.

Manual parameter control

For full control of the sketch enumeration, pass all of ham_dist, num_blocks, and num_chunks. Providing any of these switches the call into manual mode, where missing_ratio is ignored. If you specify only some of the three, the rest fall back to defaults (ham_dist=1, num_blocks=4, num_chunks=3), so it is safer to set them together:

pairs = sketchsort.search(
    X,
    cos_dist=0.01,
    ham_dist=1, num_blocks=4, num_chunks=3,
    seed=42,
)

File I/O

If you have a file in the legacy text format and want output compatible with the CMake-built C++ CLI from the same source tree, call:

sketchsort.run_from_file("input.txt", "output.txt", cos_dist=0.01, seed=42)

The input file is whitespace-separated float vectors, one per line, no ID column. The output file is id1 id2 cos_dist triples.

Command line

Installing the package also installs a sketchsort console script. It uses the same defaults as the Python API: automatic parameter selection unless you pass any of -hamdist / -numblocks / -numchunks.

# Typical: cos_dist + missing_ratio
sketchsort -cosdist 0.01 -missingratio 0.0001 -seed 42 input.txt output.txt

# Manual parameter control
sketchsort -cosdist 0.01 -hamdist 1 -numblocks 4 -numchunks 3 -seed 42 input.txt output.txt

Flags: -cosdist, -missingratio, -hamdist, -numblocks, -numchunks, -auto, -centering, -seed, -quiet. -auto forces automatic parameter selection even if any of -hamdist / -numblocks / -numchunks is also given. -centering subtracts the coordinate-wise mean from the input vectors before both sketching and distance computation (reported cos_dist is then between mean-shifted vectors).

Memory note

sketchsort.search(...) collects every reported pair into memory before returning. For large inputs at loose thresholds the result can be very large (tens of millions of pairs are realistic). If memory is a concern, use sketchsort.run_from_file(...) instead — it streams pairs to disk while the algorithm runs.

0.1.0 release notes (based on upstream 0.0.8)

Breaking changes:

  • Deterministic by default. v0.0.8 seeded the projection RNG with time(0) on every run, so results were non-reproducible. This release exposes a seed parameter (default 0) and removes the time-based seeding on every code path — C++ CLI, Python search(), Python run_from_file(), and the sketchsort console script. To reproduce the old non-deterministic behaviour pass an explicit time-based seed, e.g. sketchsort -seed $(date +%s) ....

  • exit() calls in the C++ core were replaced with std::runtime_error, surfaced to Python as RuntimeError. Callers that previously relied on the process exiting on bad input must now catch the exception.

New:

  • sketchsort.search(X, ...) Python API returning a NumPy structured array.
  • sketchsort.run_from_file(...) Python API for file-based pipelines.
  • sketchsort console script (entry point from pip install).
  • -seed <int> CLI flag (default 0).
  • -quiet CLI flag (Python entry point only) — suppresses algorithm progress output. Python search() / run_from_file() default to quiet.

Internal:

  • The C++ standard requirement moved from C++98 (-ansi) to C++17.

Build from source

Requires CMake ≥ 3.20 and a C++17 compiler.

pip install .

pip install -e ".[test]" for an editable install with the test deps, then pytest tests/.

A legacy Makefile is preserved under src/ for users who only want the C++ CLI without a Python toolchain:

cd src && make
./sketchsort -cosdist 0.01 -seed 42 ../dat/sample.txt out.txt

The Makefile build uses -ffast-math, which lets the compiler reorder floating-point operations. For most inputs the reported pair set is unchanged, but cos_dist text values can differ in the 5th–6th significant digit compared to the CMake/Python build, and pairs whose true distance is right at the threshold may also differ between builds. If you need output that matches the wheel/Python build exactly, build the CLI via CMake instead:

cmake -B build -DSKETCHSORT_BUILD_CLI=ON -DSKETCHSORT_BUILD_PYTHON=OFF \
               -DCMAKE_BUILD_TYPE=Release
cmake --build build --target sketchsort_cli
./build/sketchsort -cosdist 0.01 -seed 42 dat/sample.txt out.txt

Citation

If you use SketchSort in published work, please cite the original papers:

Methodology (the multiple-sorting technique):

Tabei, Y., Uno, T., Sugiyama, M., and Tsuda, K. (2010). Single versus Multiple Sorting in All Pairs Similarity Search. In Proceedings of the 2nd Asian Conference on Machine Learning (ACML 2010), JMLR Workshop and Conference Proceedings, 13: 145–160. PDF

@inproceedings{tabei2010sketchsort,
  title     = {Single versus Multiple Sorting in All Pairs Similarity Search},
  author    = {Tabei, Yasuo and Uno, Takeaki and Sugiyama, Masashi and Tsuda, Koji},
  booktitle = {Proceedings of the 2nd Asian Conference on Machine Learning (ACML)},
  series    = {JMLR Workshop and Conference Proceedings},
  volume    = {13},
  pages     = {145--160},
  year      = {2010},
  address   = {Tokyo, Japan},
}

Application to molecular fingerprints:

Tabei, Y. and Tsuda, K. (2011). SketchSort: Fast All Pairs Similarity Search for Large Databases of Molecular Fingerprints. Molecular Informatics 30(9): 801–807. doi:10.1002/minf.201100050

@article{tabei2011sketchsort,
  title   = {SketchSort: Fast All Pairs Similarity Search for Large Databases of Molecular Fingerprints},
  author  = {Tabei, Yasuo and Tsuda, Koji},
  journal = {Molecular Informatics},
  volume  = {30},
  number  = {9},
  pages   = {801--807},
  year    = {2011},
  doi     = {10.1002/minf.201100050},
}

License

MIT for the SketchSort source (see LICENSE). The bundled Boost headers under src/boost/ are distributed under the Boost Software License 1.0 (see NOTICE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sketchsort-0.1.0.tar.gz (9.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sketchsort-0.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (150.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

sketchsort-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (116.4 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

sketchsort-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (150.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

sketchsort-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (116.4 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

sketchsort-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

sketchsort-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (113.7 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

sketchsort-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (148.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

sketchsort-0.1.0-cp310-cp310-macosx_11_0_arm64.whl (112.3 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

sketchsort-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (148.2 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

sketchsort-0.1.0-cp39-cp39-macosx_11_0_arm64.whl (112.4 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file sketchsort-0.1.0.tar.gz.

File metadata

  • Download URL: sketchsort-0.1.0.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sketchsort-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2c9dd5f70615461683f7954482778edc409a53ee47ba93d46180e50cfb1962cb
MD5 1bd6f2484f3d2c7ba2cabd2617df7e4e
BLAKE2b-256 cdc71312f47a0fde9be6a16e482fc9f8acb3eadfdc748e5a9b11e5e4a84a4dd3

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0.tar.gz:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 902ae89f394332f93c8f92fadf67853e3c3f8459b285c18251527049fd79ec00
MD5 72333529e0df839c858b3406c3e2b66f
BLAKE2b-256 585c83b2c76a97ef56eae2d05602e272954911f4fdcd8cfdf71fa3c021382744

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1423b2f3e9c9c0922d22b5a683417aea94ccfa0915c3222be404c77f2a7aa2c7
MD5 501537afb267b7b9ed7952586aa333c2
BLAKE2b-256 25d253f875cc8af37f4aa14cf424868ae57205a013e4cfdbc17f547eacddb176

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4fdc20a586e64c0235e0813657ac881cd6fecbe906c405d6093003d1f4db3970
MD5 773417fddcfa944fa4eb4d4d6c10c86b
BLAKE2b-256 754eb7f19a54d5433c627344676b1ecd29f2c158f7a808de597790dbe5e7402e

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fcb53bb180a341dbe4745237bc07b12956b1c0b31d9576b4912a3362ca771c00
MD5 cdc9de1dc929329d8b70a59a47042981
BLAKE2b-256 91d451b6675a1a4555f6503e65907c8b30e6dc222b8853c4a3f12e5e53fc7a98

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d6ec32b323e294bc95877281f4a9afecbf174e26174a63c6a1fd4f0d66d4086d
MD5 64bad826c967d907f80ee38003883939
BLAKE2b-256 9f7d18346a8fa902adacae8cf18fa5f2b75ea2d0458247b93d9315ab0e940e81

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6c29f5305a7d0164d24c84371ae1db427bfa5e26e6cda52753252dfd2f449926
MD5 7a7c30c8462c561dfff0a984ffca8fe2
BLAKE2b-256 5966de442008492759c8565ab832caf12d1ed987d59bb795f9aae763e1381869

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4bae4922baa90f00438cdcffb72af2ed7beb95cc7cf4b43438908c467932fbc
MD5 77026da26269cfcdca9c199fd504ca21
BLAKE2b-256 506d9a0641d0f484bbfed5b214503503ab6cf8d9673769e4fa5c3c0ee6deb429

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 749806a50a8d1464d87453cfb91261731b5c509df499904a01f2479d98394e56
MD5 950bd03458409528e111489b94d1b5a9
BLAKE2b-256 823620c4260f7b4deb201ea67045f1b46e5e88cd7b58676570606015953a0c4f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ffa197bd3b7b994cd71e64962f1f603695e394cc78ed4f5d24245e37a3c64792
MD5 2514bb6e7bf4c404b5bac96bcfc38843
BLAKE2b-256 ac0955a101d1a47343bcfc20993400801033dd31a87d545f2e94b3317d67c02c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sketchsort-0.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sketchsort-0.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9aadfe0718a7bdeb5b5a0d03bdf557e9df69a462b63a5ccf68a3f1c2c2e4785
MD5 59f6b7710e891d110a02e6e32de75479
BLAKE2b-256 594e28a1f32883a427969445fbdd94b66be46c1175f834952293a58ed5d3dd2f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sketchsort-0.1.0-cp39-cp39-macosx_11_0_arm64.whl:

Publisher: wheels.yml on tb-yasu/sketchsort

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page