High-performance FastSketch with SIMD acceleration to deduplicate large-scale data
Project description
FastSketchLSH
Introduction
FastSketchLSH delivers a Python-first package that wraps a high-performance C++/SIMD implementation of Fast Similarity Sketch (see Dahlgaard et al., FOCS'17 arXiv:1704.04370 for the underlying algorithm). The goal is to make Jaccard estimation and locality-sensitive hashing (LSH) practical for large dataset deduplication.
Headline Results
FastSimilaritySketchmaintains sub-millisecond sketch times even when each set holds 1 600 tokens, keeping the absolute Jaccard error around 0.03–0.06.- At the sketch level, FastSimilaritySketch stays 200×–990× faster than
datasketchMinHash while matching its accuracy on the included microbenchmarks. - Ground-truth comparisons confirm FastSketchLSH remains competitive on deduplication quality.
Current package version: 1.0.1.
This patch release keeps the v1.0.0 API surface unchanged and adds pytest collection isolation so default test runs stay on the local extension under active development.
From v0.2.0 to v1.0.0
v1.0.0 keeps the same native kernels but simplifies the public Python API and adds a true LSH one-shot duplicate path. The main user-visible changes are:
FastSimilaritySketch(sketch_size=...)->FastSimilaritySketch(size=...)sketch(...)->__call__(...)sketch_batch(...)->batch(...)sketch_batch_flat_csr_prehashed(...)->batch_csr(..., prehashed=True)build_from_batch(...)->insert(...)query_candidates(...)/batch_query_csr(...)->query(...)- New:
insert_and_query_duplicates(...)for fused LSH build + duplicate flagging
The full upgrade record lives in docs/release-notes-v0.2.0-to-v1.0.0.md.
Pre-hashed inputs are still supported; they are now exposed through the unified prehashed=True flag:
import numpy as np
from FastSketchLSH import FastSimilaritySketch
sketcher = FastSimilaritySketch(size=256, seed=42)
# Single sketch from pre-hashed values (zero-copy from NumPy)
hashes = np.array([0xDEAD, 0xBEEF, 0xCAFE, ...], dtype=np.uint64)
digest = sketcher(hashes, prehashed=True)
# Batch of pre-hashed arrays
batch = [np.array([...], dtype=np.uint64) for _ in range(1000)]
digests = sketcher.batch(batch, prehashed=True, num_threads=8)
# CSR layout for maximum throughput
data = np.array([...], dtype=np.uint64)
indptr = np.array([0, 120, 250, 500], dtype=np.uint64)
digests = sketcher.batch_csr(data, indptr, prehashed=True, num_threads=8)
All prehashed paths share the same SIMD-accelerated Round 1 / Round 2 bucket-fill logic and OpenMP batch parallelism as the token-based paths.
How It Works
- Fast Similarity Sketching: SIMD-accelerated permutations compress a set into a fixed-length signature, expected time
O(n + k log k)withO(k)space. - Banded LSH: Signature rows are grouped into bands; items colliding in any band become candidates for deduplication.
- Python ergonomics: Thin wrappers expose the C++ core, plus reference implementations of competing sketches for fair comparisons.
Installation
Prerequisite: Python 3.11 or newer. Support for Python 3.8 and older interpreters is on the roadmap.
PyPI (recommended)
pip install fastsketchlsh
Build from source
- Build the native extension:
cd fastsketchlsh_ext pip install .
This installs theFastSketchLSHPython module with SIMD kernels. - Install benchmark utilities (optional for reproducing experiments):
pip install -r requirements.txt
- Activate your environment (e.g.
source .venv/bin/activate) before running scripts.
Quick Start
Sketch two sets and estimate their Jaccard similarity
from FastSketchLSH import FastSimilaritySketch, estimate_jaccard
# Build list_a with 16,000 tokens labeled "a-0" to "a-15999"
# Build list_b with 8,000 overlapping + 8,000 new tokens (true Jaccard = 1/3)
list_a = [f"a-{i}" for i in range(16_000)]
list_b = [f"a-{i}" for i in range(8_000)] + [f"b-{i}" for i in range(8_000)]
sketcher = FastSimilaritySketch(size=256)
sig_a = sketcher(list_a)
sig_b = sketcher(list_b)
estimated = estimate_jaccard(sig_a, sig_b)
print(f"Estimated Jaccard similarity: {estimated:.4f}")
Deduplication with LSH
This end-to-end sample downloads a small slice of Hugging Face’s lucadiliello/bookcorpusopen corpus, sketches every document with k=128, and groups the signatures into 16 bands. Sketching each document costs O(n + k log k) time with O(k) space, while an LSH probe runs in O(k + c) where c is the number of retrieved candidates. For pure duplicate-flagging, insert_and_query_duplicates(...) fuses LSH build and duplicate detection into one pass.
from __future__ import annotations
from datasets import load_dataset
from FastSketchLSH import FastSimilaritySketch, LSH
def tokenize(text: str) -> list[str]:
return sorted({token for token in text.lower().split() if token})
# Here, 'train[:2048]' tells Hugging Face Datasets to select only the first 2048 rows from the 'train' split.
dataset = load_dataset(
"lucadiliello/bookcorpusopen",
split="train[:2048]")
texts = [row["text"] for row in dataset if row.get("text")]
token_sets = [tokenize(text) for text in texts]
sketcher = FastSimilaritySketch(size=128, seed=42)
sketch_matrix = sketcher.batch(token_sets)
lsh = LSH(num_perm=128, num_bands=16)
# One-shot path: insert the batch and return duplicate flags immediately.
dup_flags = lsh.insert_and_query_duplicates(sketch_matrix).tolist()
doc_idx = 0
candidates = lsh.query(sketch_matrix[doc_idx])
print(f"Candidates for {doc_idx}:", candidates)
print("Duplicate flags:", dup_flags)
print("Total duplicates detected:", sum(dup_flags))
Experiment Summaries
- Sketch microbenchmarks (
exps/sketch/): Full write-up, CSVs, and plotting helpers demonstrating latency and accuracy versus standard MinHash baselines. Reproduction steps live inexps/sketch/README.md. - Ground-truth accuracy (
exps/accuracy/): Jaccard estimation and dedup quality measured against labelled datasets. Seeexps/accuracy/README.mdfor reproduction commands. - End-to-end pipelines (
exps/end2end/): Thread-scaled deduplication sweeps on large corpora, plus scripts for batch comparisons. The current single-thread FastSketch path uses true LSH one-shot duplicate flagging. Details inexps/end2end/README.md.
Each experiment directory includes figures, CSV outputs, and exact command lines so you can replicate every result.
Key Points
- FastSketchLSH packages a SIMD-backed sketch with Python convenience wrappers.
- Headline benchmarks show up to 990× throughput gains over classic MinHash at comparable accuracy.
- Ready-to-run examples cover sketching, LSH-based deduplication, and full dataset experiments.
- For deeper reproduction details, consult the README in each experiment subdirectory.
Future Work
- A MapReduce/Spark demo to deduplicate large datasets in distributed systems.
- A friendlier Python interface aligned with
datasketchergonomics.
License
MIT. Research and educational use welcome.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastsketchlsh-1.0.1.tar.gz.
File metadata
- Download URL: fastsketchlsh-1.0.1.tar.gz
- Upload date:
- Size: 55.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19841ca74853ec0c4fa8bdcdb3dadec3613c524c5fbab05c6aef99017a0f6bf7
|
|
| MD5 |
db59d389f5d240cbc4a8d075102d2fb5
|
|
| BLAKE2b-256 |
8497e12ebd043525836d29dd72218be78b1c113fdf4bdb3245259b39bf99a1cb
|
File details
Details for the file fastsketchlsh-1.0.1-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 140.0 kB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
936f7adf902f5ee2372d58d22270368b17d4b61b0c886169d8be4ad62fd6a1ca
|
|
| MD5 |
a60367dad59a26f235a006c1fa9d5940
|
|
| BLAKE2b-256 |
bcbff13ac3b133708c50175f39c7a6570783772880cc589805942b4bd6d14e9b
|
File details
Details for the file fastsketchlsh-1.0.1-cp312-cp312-win32.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp312-cp312-win32.whl
- Upload date:
- Size: 128.2 kB
- Tags: CPython 3.12, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
553570d220d32ca1abc2239ceccd3d82d3e35d771b3cd57be480e51f174fad59
|
|
| MD5 |
4bd15323b273f3b5a4dd542f6ec68dde
|
|
| BLAKE2b-256 |
4ca0571038fbfd95fb8f59b8bc9cb9c7d50b61693895492ca35e89f502b57e1c
|
File details
Details for the file fastsketchlsh-1.0.1-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.12, manylinux: glibc 2.26+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0328cf36fed22302449a258595142ea98f5c9cf12d061553551fe11248c6dbff
|
|
| MD5 |
112a6bf561502c6214e9991052801925
|
|
| BLAKE2b-256 |
255b56f7e085b9ab0ea9970cf18268c63bc504725089f6d4bb603dd7ca95fb94
|
File details
Details for the file fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_x86_64.whl
- Upload date:
- Size: 472.6 kB
- Tags: CPython 3.12, macOS 14.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73e4349beb62ffe1b43af73229ec9b09790ea1b1a0c43a2f5e65f94e7bcae69d
|
|
| MD5 |
bb87cb20bdcfa76668b58dd6c5fc6240
|
|
| BLAKE2b-256 |
747dc4d741e59c132f3b5ce9dcb24bf46b7430d0f7598e6e154a30e54a095d7f
|
File details
Details for the file fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_arm64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_arm64.whl
- Upload date:
- Size: 176.8 kB
- Tags: CPython 3.12, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93cc190a95010bb14c8852d4a92d730961fa876ec2e040d095ea0564d3115915
|
|
| MD5 |
3b41de64052f784a270650cd10fc23c1
|
|
| BLAKE2b-256 |
534e5af64ab17f85347259a88af3abc11ced0adeb73654b6d4c29c3107a1563b
|
File details
Details for the file fastsketchlsh-1.0.1-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 138.5 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae392b78d1e30450f4eed15fcdf05733b22eff21d4356a7117fc8712855c2d4f
|
|
| MD5 |
cfbaf1a23e3da79f1cca1b43770527f8
|
|
| BLAKE2b-256 |
66b76f7f362a80fd863571d5023312eaf3b714b46edf7550a03058f32212342f
|
File details
Details for the file fastsketchlsh-1.0.1-cp311-cp311-win32.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp311-cp311-win32.whl
- Upload date:
- Size: 127.3 kB
- Tags: CPython 3.11, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e467ccff7302875b067534b447fabf54fe8c4b6770a13f84be08a57d330bd4c3
|
|
| MD5 |
1e1fa16d9acdc6cd5bad35987dda4ca8
|
|
| BLAKE2b-256 |
63cd2b86133e658730f176ab8c736f94128e7449e358bcc3a2e72641776c269f
|
File details
Details for the file fastsketchlsh-1.0.1-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.11, manylinux: glibc 2.26+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b369cbcd8b6b4cce9b082eae560371234360e7d875ca7a5cc187c84e8b197a3
|
|
| MD5 |
d18be3f2531df945922bbe61cbec1428
|
|
| BLAKE2b-256 |
dcea2ff92db31dd79936ff4dc03c52e6831effe381780a8c324ab4ea739f8c78
|
File details
Details for the file fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_x86_64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_x86_64.whl
- Upload date:
- Size: 470.4 kB
- Tags: CPython 3.11, macOS 14.0+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d5e655563cca60f79d50ed9830fa60f62b92914a9a632057d7bc37c5a4e1afb
|
|
| MD5 |
c352a90113bd4913aea52273c9bb1437
|
|
| BLAKE2b-256 |
8084ad90fe2ff1c6d89e4344687dc1b88a8dc18260e6191cbcc55af1bbeeb8b2
|
File details
Details for the file fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_arm64.whl.
File metadata
- Download URL: fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_arm64.whl
- Upload date:
- Size: 175.9 kB
- Tags: CPython 3.11, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e42b222c47b4a789594eb7b5196f5a7b065a379366b7f831118761506a4eb3c
|
|
| MD5 |
c6a07fc85b561282b4e5732fe678decf
|
|
| BLAKE2b-256 |
0c99beab0aac79103b8b8dc441fab27a4c81cc6899f106d7c08ecd0476eb0f51
|