High-performance FastSketch with SIMD acceleration to deduplicate large-scale data
Project description
FastSketchLSH
Introduction
FastSketchLSH delivers a Python-first package that wraps a high-performance C++/SIMD implementation of Fast Similarity Sketch (see Dahlgaard et al., FOCS'17 arXiv:1704.04370 for the underlying algorithm). The goal is to make Jaccard estimation and locality-sensitive hashing (LSH) practical for large dataset deduplication.
| Dataset | Engine | Sketch (s) | Build (s) | Query (s) | Total (s) | FastSketchLSH Sketch Speedup | FastSketchLSH Total Speedup |
|---|---|---|---|---|---|---|---|
| BOOKCORPUSOPEN | rensa | 198.545 | 0.026 | 0.018 | 198.589 | - | - |
| BOOKCORPUSOPEN | fastsketchlsh | 55.280 | 0.039 | 0.031 | 55.350 | 3.59× | 3.59× |
| BOOKS3 | rensa | 95.915 | 0.005 | 0.003 | 95.923 | - | - |
| BOOKS3 | fastsketchlsh | 28.440 | 0.008 | 0.007 | 28.455 | 3.37× | 3.37× |
| PINECONE | rensa | 3.929 | 0.141 | 0.153 | 4.223 | - | - |
| PINECONE | fastsketchlsh | 1.521 | 0.249 | 0.396 | 2.166 | 2.58× | 1.95× |
| SHUYUEJ | rensa | 3.749 | 0.037 | 0.044 | 3.830 | - | - |
| SHUYUEJ | fastsketchlsh | 1.132 | 0.093 | 0.121 | 1.346 | 3.31× | 2.85× |
Headline Results
FastSimilaritySketchmaintains sub-millisecond sketch times even when each set holds 1 600 tokens, keeping the absolute Jaccard error around 0.03–0.06.- At the sketch level, FastSimilaritySketch stays 200×–990× faster than
datasketchMinHash and still 8×–23× faster than Rensa’sCMinHash/RMinHash, while matching their accuracy—these gains matter most for large documents. - End-to-end deduplication experiments show FastSketchLSH is typically ~2×–3.6× faster than Rensa in single-thread runs.
- Ground-truth comparisons confirm FastSketchLSH matches or slightly exceeds the deduplication accuracy of both Rensa and datasketch.
How It Works
- Fast Similarity Sketching: SIMD-accelerated permutations compress a set into a fixed-length signature, expected time
O(n + k log k)withO(k)space. - Banded LSH: Signature rows are grouped into bands; items colliding in any band become candidates for deduplication.
- Python ergonomics: Thin wrappers expose the C++ core, plus reference implementations of competing sketches for fair comparisons.
Installation
Prerequisite: Python 3.11 or newer. Support for Python 3.8 and older interpreters is on the roadmap.
PyPI (recommended)
pip install fastsketchlsh
Build from source
- Build the native extension:
cd fastsketchlsh_ext pip install .
This installs theFastSketchLSHPython module with SIMD kernels. - Install benchmark utilities (optional for reproducing experiments):
pip install -r requirements.txt
- Activate your environment (e.g.
source .venv/bin/activate) before running scripts.
Quick Start
Sketch two sets and estimate their Jaccard similarity
from FastSketchLSH import FastSimilaritySketch, estimate_jaccard
# Build list_a with 16,000 tokens labeled "a-0" to "a-15999"
# Build list_b with 8,000 overlapping + 8,000 new tokens (true Jaccard = 1/3)
list_a = [f"a-{i}" for i in range(16_000)]
list_b = [f"a-{i}" for i in range(8_000)] + [f"b-{i}" for i in range(8_000)]
sketcher = FastSimilaritySketch(sketch_size=256)
sig_a = sketcher.sketch(list_a)
sig_b = sketcher.sketch(list_b)
estimated = estimate_jaccard(sig_a, sig_b)
print(f"Estimated Jaccard similarity: {estimated:.4f}")
Deduplication with LSH
This end-to-end sample downloads a small slice of Hugging Face’s lucadiliello/bookcorpusopen corpus, sketches every document with k=128, and groups the signatures into 16 bands. Sketching each document costs O(n + k log k) time with O(k) space, while an LSH probe runs in O(k + c) where c is the number of retrieved candidates.
from __future__ import annotations
from datasets import load_dataset
from FastSketchLSH import FastSimilaritySketch, LSH
def tokenize(text: str) -> list[str]:
return sorted({token for token in text.lower().split() if token})
# Here, 'train[:2048]' tells Hugging Face Datasets to select only the first 2048 rows from the 'train' split.
dataset = load_dataset(
"lucadiliello/bookcorpusopen",
split="train[:2048]")
texts = [row["text"] for row in dataset if row.get("text")]
token_sets = [tokenize(text) for text in texts]
sketcher = FastSimilaritySketch(sketch_size=128, seed=42)
# Use batch mode for faster sketching (much faster than one-by-one)
sketch_matrix = sketcher.sketch_batch(token_sets)
lsh = LSH(num_perm=128, num_bands=16)
lsh.build_from_batch(sketch_matrix)
doc_idx = 0
candidates = lsh.query_candidates(sketch_matrix[doc_idx])
print(f"Candidates for {doc_idx}:", candidates)
dup_flags = [1 if len(lsh.query_candidates(row)) > 1 else 0 for row in sketch_matrix]
print("Duplicate flags:", dup_flags)
print("Total duplicates detected:", sum(dup_flags))
Experiment Summaries
- Sketch microbenchmarks (
exps/sketch/): Full write-up, CSVs, and plotting helpers demonstrating latency and accuracy versusdatasketchand Rensa baselines. Reproduction steps live inexps/sketch/README.md. - Ground-truth accuracy (
exps/accuracy/): Jaccard estimation and dedup quality measured against labelled datasets. Seeexps/accuracy/README.mdfor reproduction commands. - End-to-end pipelines (
exps/end2end/): Thread-scaled deduplication sweeps on large corpora, plus scripts for batch comparisons. Details inexps/end2end/README.md.
Each experiment directory includes figures, CSV outputs, and exact command lines so you can replicate every result.
Key Points
- FastSketchLSH packages a SIMD-backed sketch with Python convenience wrappers.
- Headline benchmarks show up to 990× throughput gains over classic MinHash at comparable accuracy.
- Ready-to-run examples cover sketching, LSH-based deduplication, and full dataset experiments.
- For deeper reproduction details, consult the README in each experiment subdirectory.
Future Work
- A MapReduce/Spark demo to deduplicate large datasets in distributed systems.
- A friendlier Python interface aligned with
datasketchergonomics.
License
MIT. Research and educational use welcome.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastsketchlsh-0.1.2.tar.gz.
File metadata
- Download URL: fastsketchlsh-0.1.2.tar.gz
- Upload date:
- Size: 53.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00c16e13d4b6fe0d32152bb532e825c401b7a2764dddc44954ed8a346c043ae0
|
|
| MD5 |
686da37ec55d81837f98f2c28acc8355
|
|
| BLAKE2b-256 |
5815709e2ff4630d571eb6c7f78ba3ec3b0faaa100a8fcd7cde23d3c036ff7ec
|
File details
Details for the file fastsketchlsh-0.1.2-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 142.5 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed387764c73aa56633d726407d2c6f6bd12cefef05b3870b3cbc6afae928a585
|
|
| MD5 |
c83d98da930d80ec03e9bbe8778ee687
|
|
| BLAKE2b-256 |
d8f09d6c28276f1a94083bdbf8c64fb8cbba3659985578c078edb5e7dfabb949
|
File details
Details for the file fastsketchlsh-0.1.2-cp312-cp312-win32.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp312-cp312-win32.whl
- Upload date:
- Size: 131.8 kB
- Tags: CPython 3.12, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
890f58d9819a804b17e74a31ce4064602b1d8ac906d001966741d71670d492d7
|
|
| MD5 |
0bf572c3446c09df9f2ab42098b61e03
|
|
| BLAKE2b-256 |
26a99df25bc9bb93115e2cb6a2b63b441ec51d213adb22d1046870b5dc334e0e
|
File details
Details for the file fastsketchlsh-0.1.2-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.12, manylinux: glibc 2.26+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38b5856e8c32a08293bd10cfc7474b345699697569e04dc017b31973fe4d2be3
|
|
| MD5 |
097b1a2b76f8dadc5720fe414722fd2c
|
|
| BLAKE2b-256 |
c6c5c06cb4689ec6ceee8d44f05776e733d986e3879945afe5715344ba961fd5
|
File details
Details for the file fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_x86_64.whl
- Upload date:
- Size: 488.1 kB
- Tags: CPython 3.12, macOS 14.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa354d0d30a289464886caa513e3c50c8b89ace71fa9b813e4862cdb40d91c20
|
|
| MD5 |
68ff14422fad4826d054f24e165e8ada
|
|
| BLAKE2b-256 |
b7ff0a342806936f168fd8839fa157e88cda52d06b736062cbebe96fe56b29a8
|
File details
Details for the file fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_arm64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_arm64.whl
- Upload date:
- Size: 194.2 kB
- Tags: CPython 3.12, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76a6249b7be5f050647621fcb94424520eceedda7598e4ecde9dbe884076fd00
|
|
| MD5 |
0e1e71f246bbdab2e029fd0004f22840
|
|
| BLAKE2b-256 |
fce926ed4fa034a0cc69e6472b15094bb71414120eb7ba1b620f55691903d4dc
|
File details
Details for the file fastsketchlsh-0.1.2-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 141.6 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65df43a1c1a626f4ba29a35862069d0e86beee29d63a29130749673b05105240
|
|
| MD5 |
6b2e59164463b920bf3347ddad2ab8c8
|
|
| BLAKE2b-256 |
5a0c3b484a58603dbb6568d377d6d25c57b562839a6b05f9a3804faed3cd8476
|
File details
Details for the file fastsketchlsh-0.1.2-cp311-cp311-win32.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp311-cp311-win32.whl
- Upload date:
- Size: 130.8 kB
- Tags: CPython 3.11, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0506d75ddcd6df23ebaa938e9d7c8498d6fb94ee6517d651566fc3e86e034632
|
|
| MD5 |
5bb7b4da55e007fb3117241f9c9ccd72
|
|
| BLAKE2b-256 |
b7a11252097ee7c1b306fb02a0e135a359335cd45a6bf1673d3f9b919b80c14c
|
File details
Details for the file fastsketchlsh-0.1.2-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.11, manylinux: glibc 2.26+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fb147c22a09cdc3022a490bfce7f260befcea75decec7de89df86848a42e5e7
|
|
| MD5 |
0cadd4563d4be04f8bc1b7f154a2e827
|
|
| BLAKE2b-256 |
5a01a8ca5349696dcde923da1c640ff71d849dbad85082a0065535ee48c05210
|
File details
Details for the file fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_x86_64.whl
- Upload date:
- Size: 486.0 kB
- Tags: CPython 3.11, macOS 14.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee23b5670356dc5db55b2e6dc32ce0c9c70157b070d08ef2b1afacfbe38095f0
|
|
| MD5 |
a2f6e6a06f24915f08f7e14ed33de469
|
|
| BLAKE2b-256 |
57bc726039399de816d51b9b52d0d64ee303e7a4d3a9f28b270c0dc897cc35f9
|
File details
Details for the file fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_arm64.whl.
File metadata
- Download URL: fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_arm64.whl
- Upload date:
- Size: 193.6 kB
- Tags: CPython 3.11, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3ae24ed664c9feae2d41b5901aef5dbbfe467cadb92d47dc6c774455b8d5c4e
|
|
| MD5 |
1fe686167f1ac22f9543091dde40989d
|
|
| BLAKE2b-256 |
c93d8f0c5f2075ea32aacd03f4ad7b51aa1f8676eecee229b5701d6f01874fd2
|