High-performance FastSketch with SIMD acceleration to deduplicate large-scale data

These details have not been verified by PyPI

Project links

Project description

FastSketchLSH

Introduction

FastSketchLSH delivers a Python-first package that wraps a high-performance C++/SIMD implementation of Fast Similarity Sketch (see Dahlgaard et al., FOCS'17 arXiv:1704.04370 for the underlying algorithm). The goal is to make Jaccard estimation and locality-sensitive hashing (LSH) practical for large dataset deduplication.

FastSimilaritySketch throughput advantage

Dataset	Engine	Sketch (s)	Build (s)	Query (s)	Total (s)	FastSketchLSH Sketch Speedup	FastSketchLSH Total Speedup
BOOKCORPUSOPEN	rensa	198.545	0.026	0.018	198.589	-	-
BOOKCORPUSOPEN	fastsketchlsh	55.280	0.039	0.031	55.350	3.59×	3.59×
BOOKS3	rensa	95.915	0.005	0.003	95.923	-	-
BOOKS3	fastsketchlsh	28.440	0.008	0.007	28.455	3.37×	3.37×
PINECONE	rensa	3.929	0.141	0.153	4.223	-	-
PINECONE	fastsketchlsh	1.521	0.249	0.396	2.166	2.58×	1.95×
SHUYUEJ	rensa	3.749	0.037	0.044	3.830	-	-
SHUYUEJ	fastsketchlsh	1.132	0.093	0.121	1.346	3.31×	2.85×

Headline Results

FastSimilaritySketch maintains sub-millisecond sketch times even when each set holds 1 600 tokens, keeping the absolute Jaccard error around 0.03–0.06.
At the sketch level, FastSimilaritySketch stays 200×–990× faster than datasketch MinHash and still 8×–23× faster than Rensa’s CMinHash/RMinHash, while matching their accuracy—these gains matter most for large documents.
End-to-end deduplication experiments show FastSketchLSH is typically ~2×–3.6× faster than Rensa in single-thread runs.
Ground-truth comparisons confirm FastSketchLSH matches or slightly exceeds the deduplication accuracy of both Rensa and datasketch.

What's New in v0.2.0

Pre-hashed input support -- sketch_prehashed, sketch_batch_prehashed, and sketch_batch_flat_csr_prehashed methods now accept np.uint64 or np.int64 arrays of user-provided hash values directly. This skips the internal prehash phase entirely (no hash_int32, no fnv1a64), which is useful when you hash tokens yourself or reuse hash values across different sketch configurations.

import numpy as np
from FastSketchLSH import FastSimilaritySketch

sketcher = FastSimilaritySketch(sketch_size=256, seed=42)

# Single sketch from pre-hashed values (zero-copy from NumPy)
hashes = np.array([0xDEAD, 0xBEEF, 0xCAFE, ...], dtype=np.uint64)
digest = sketcher.sketch_prehashed(hashes)

# Batch of pre-hashed arrays
batch = [np.array([...], dtype=np.uint64) for _ in range(1000)]
digests = sketcher.sketch_batch_prehashed(batch, num_threads=8)

# CSR layout for maximum throughput
data = np.array([...], dtype=np.uint64)
indptr = np.array([0, 120, 250, 500], dtype=np.uint64)
digests = sketcher.sketch_batch_flat_csr_prehashed(data, indptr, num_threads=8)

All prehashed paths share the same SIMD-accelerated Round 1 / Round 2 bucket-fill logic and OpenMP batch parallelism as the existing sketch methods.

How It Works

Fast Similarity Sketching: SIMD-accelerated permutations compress a set into a fixed-length signature, expected time O(n + k log k) with O(k) space.
Banded LSH: Signature rows are grouped into bands; items colliding in any band become candidates for deduplication.
Python ergonomics: Thin wrappers expose the C++ core, plus reference implementations of competing sketches for fair comparisons.

Installation

Prerequisite: Python 3.11 or newer. Support for Python 3.8 and older interpreters is on the roadmap.

PyPI (recommended)

pip install fastsketchlsh

Build from source

Build the native extension:
```
cd fastsketchlsh_ext
pip install .
```
This installs the FastSketchLSH Python module with SIMD kernels.
Install benchmark utilities (optional for reproducing experiments):
```
pip install -r requirements.txt
```
Activate your environment (e.g. source .venv/bin/activate) before running scripts.

Quick Start

Sketch two sets and estimate their Jaccard similarity

from FastSketchLSH import FastSimilaritySketch, estimate_jaccard

# Build list_a with 16,000 tokens labeled "a-0" to "a-15999"
# Build list_b with 8,000 overlapping + 8,000 new tokens (true Jaccard = 1/3)
list_a = [f"a-{i}" for i in range(16_000)]
list_b = [f"a-{i}" for i in range(8_000)] + [f"b-{i}" for i in range(8_000)]

sketcher = FastSimilaritySketch(sketch_size=256)
sig_a = sketcher.sketch(list_a)
sig_b = sketcher.sketch(list_b)

estimated = estimate_jaccard(sig_a, sig_b)
print(f"Estimated Jaccard similarity: {estimated:.4f}")

Deduplication with LSH

This end-to-end sample downloads a small slice of Hugging Face’s lucadiliello/bookcorpusopen corpus, sketches every document with k=128, and groups the signatures into 16 bands. Sketching each document costs O(n + k log k) time with O(k) space, while an LSH probe runs in O(k + c) where c is the number of retrieved candidates.

from __future__ import annotations

from datasets import load_dataset
from FastSketchLSH import FastSimilaritySketch, LSH


def tokenize(text: str) -> list[str]:
    return sorted({token for token in text.lower().split() if token})


# Here, 'train[:2048]' tells Hugging Face Datasets to select only the first 2048 rows from the 'train' split.
dataset = load_dataset(
    "lucadiliello/bookcorpusopen",
    split="train[:2048]")

texts = [row["text"] for row in dataset if row.get("text")]
token_sets = [tokenize(text) for text in texts]

sketcher = FastSimilaritySketch(sketch_size=128, seed=42)
# Use batch mode for faster sketching (much faster than one-by-one)
sketch_matrix = sketcher.sketch_batch(token_sets)

lsh = LSH(num_perm=128, num_bands=16)
lsh.build_from_batch(sketch_matrix)

doc_idx = 0
candidates = lsh.query_candidates(sketch_matrix[doc_idx])
print(f"Candidates for {doc_idx}:", candidates)

dup_flags = [1 if len(lsh.query_candidates(row)) > 1 else 0 for row in sketch_matrix]
print("Duplicate flags:", dup_flags)
print("Total duplicates detected:", sum(dup_flags))

Experiment Summaries

Sketch microbenchmarks (exps/sketch/): Full write-up, CSVs, and plotting helpers demonstrating latency and accuracy versus datasketch and Rensa baselines. Reproduction steps live in exps/sketch/README.md.
Ground-truth accuracy (exps/accuracy/): Jaccard estimation and dedup quality measured against labelled datasets. See exps/accuracy/README.md for reproduction commands.
End-to-end pipelines (exps/end2end/): Thread-scaled deduplication sweeps on large corpora, plus scripts for batch comparisons. Details in exps/end2end/README.md.

Each experiment directory includes figures, CSV outputs, and exact command lines so you can replicate every result.

Key Points

FastSketchLSH packages a SIMD-backed sketch with Python convenience wrappers.
Headline benchmarks show up to 990× throughput gains over classic MinHash at comparable accuracy.
Ready-to-run examples cover sketching, LSH-based deduplication, and full dataset experiments.
For deeper reproduction details, consult the README in each experiment subdirectory.

Future Work

A MapReduce/Spark demo to deduplicate large datasets in distributed systems.
A friendlier Python interface aligned with datasketch ergonomics.

License

MIT. Research and educational use welcome.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Mar 17, 2026

This version

0.2.0

Feb 19, 2026

0.1.2

Feb 12, 2026

0.1.1

Nov 23, 2025

0.1.0

Nov 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsketchlsh-0.2.0.tar.gz (55.1 kB view details)

Uploaded Feb 19, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastsketchlsh-0.2.0-cp312-cp312-win_amd64.whl (151.4 kB view details)

Uploaded Feb 19, 2026 CPython 3.12Windows x86-64

fastsketchlsh-0.2.0-cp312-cp312-win32.whl (137.8 kB view details)

Uploaded Feb 19, 2026 CPython 3.12Windows x86

fastsketchlsh-0.2.0-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.5 MB view details)

Uploaded Feb 19, 2026 CPython 3.12manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_x86_64.whl (500.3 kB view details)

Uploaded Feb 19, 2026 CPython 3.12macOS 14.0+ x86-64

fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_arm64.whl (204.9 kB view details)

Uploaded Feb 19, 2026 CPython 3.12macOS 14.0+ ARM64

fastsketchlsh-0.2.0-cp311-cp311-win_amd64.whl (150.1 kB view details)

Uploaded Feb 19, 2026 CPython 3.11Windows x86-64

fastsketchlsh-0.2.0-cp311-cp311-win32.whl (136.5 kB view details)

Uploaded Feb 19, 2026 CPython 3.11Windows x86

fastsketchlsh-0.2.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.5 MB view details)

Uploaded Feb 19, 2026 CPython 3.11manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_x86_64.whl (497.9 kB view details)

Uploaded Feb 19, 2026 CPython 3.11macOS 14.0+ x86-64

fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_arm64.whl (204.2 kB view details)

Uploaded Feb 19, 2026 CPython 3.11macOS 14.0+ ARM64

File details

Details for the file fastsketchlsh-0.2.0.tar.gz.

File metadata

Download URL: fastsketchlsh-0.2.0.tar.gz
Upload date: Feb 19, 2026
Size: 55.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6e772cf512ca21c4e8ecc4faf42e36cb5902465899651fcdf836953098633332`
MD5	`2326367ae608440bedc24ec0cca7d68e`
BLAKE2b-256	`2143c574277552d9b9062e7d29f4236b2b901beab1aaa4b5194c7a562c553a23`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp312-cp312-win_amd64.whl
Upload date: Feb 19, 2026
Size: 151.4 kB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`6d118c7c9e11932767e5dbbe84131319ddc6dd12d4d4388ca879b9c91f42126c`
MD5	`9ebbfcbcd686d96a07f65f02ace3db0a`
BLAKE2b-256	`7d0975f02dfc670126fff6a444ac981f95b9a3970ba7b71b3353fcb5b6c0ee1f`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-win32.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp312-cp312-win32.whl
Upload date: Feb 19, 2026
Size: 137.8 kB
Tags: CPython 3.12, Windows x86
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-win32.whl
Algorithm	Hash digest
SHA256	`f76217580114dd3c21629a7c2e2af6478a150ac98e2f7ea1c9c7e72dd2da7f57`
MD5	`b599b9df329cf52b375d7b710933b220`
BLAKE2b-256	`e7131e0a3f600c91b51914d24a2161b8b586badcb9c18578ba2e6b3229d10c87`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Upload date: Feb 19, 2026
Size: 3.5 MB
Tags: CPython 3.12, manylinux: glibc 2.26+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`2ff5886193ad930b045912060ad391fdbd673dd70761f5677c8d7a50722ebf4a`
MD5	`d340834e65a537e397a7ae2b14171bd9`
BLAKE2b-256	`0628c453cee0af0f690c229ff0bc24b9614db91273ee0935225006a6b3ba290a`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_x86_64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_x86_64.whl
Upload date: Feb 19, 2026
Size: 500.3 kB
Tags: CPython 3.12, macOS 14.0+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_x86_64.whl
Algorithm	Hash digest
SHA256	`955f2e43e740f37923beb47a7e1245671ded26f71c15cdbd87a35b5b67d50008`
MD5	`f39c0966aed9c4cb2d978d13e35adad1`
BLAKE2b-256	`f22cbd0fbf76fd59b5ce3cf0efd15dea9cdc0ba9e1ffb602a715bdae2e8bc82a`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_arm64.whl
Upload date: Feb 19, 2026
Size: 204.9 kB
Tags: CPython 3.12, macOS 14.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm	Hash digest
SHA256	`99955ad343b74c5689b7d9d5a0a23834077988d06030856864f7311bdeb3e0d7`
MD5	`856b77e8986e90a63bc80f8d4a5da738`
BLAKE2b-256	`a552a0488df428b93cb2de90d4a35e8d22e0b2926ee876a010ada624d6b11ac1`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp311-cp311-win_amd64.whl
Upload date: Feb 19, 2026
Size: 150.1 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`beaa8f6902e4d6cd7c283a3092673ed2513c678976cef87ec6839b76d2d58340`
MD5	`02ad72ad539312b98886ac29fe601e74`
BLAKE2b-256	`090ec5c761523cf085d4956c20f43b7f0e9a3a867b16c3d8bdafc48ec3a935f1`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-win32.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp311-cp311-win32.whl
Upload date: Feb 19, 2026
Size: 136.5 kB
Tags: CPython 3.11, Windows x86
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-win32.whl
Algorithm	Hash digest
SHA256	`00799bd9d4b6361dc52dd67203325e34e6dad411a56bec3de54abb6c7c4bff05`
MD5	`ae102e0c7aff3228c0ef77202a319fe4`
BLAKE2b-256	`99dfb0a5b1265770a9b5e8705a47b6b1bce140572f1b8dc7fa466dad56ce4028`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Upload date: Feb 19, 2026
Size: 3.5 MB
Tags: CPython 3.11, manylinux: glibc 2.26+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`54b1a3d189cb768a4fe98317146623ef362c35736af6c190193f73d5d37f80c2`
MD5	`f756bd6798eafb74be4a5b4ced988f0f`
BLAKE2b-256	`b1893baefe13091657d0e9b7b019f56d8ea361c6495e0eddd79f5042d1e1a94c`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_x86_64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_x86_64.whl
Upload date: Feb 19, 2026
Size: 497.9 kB
Tags: CPython 3.11, macOS 14.0+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm	Hash digest
SHA256	`62f2082de4d73e0121b969727dab7be342a2a273541c33945b78bf18944521c2`
MD5	`f190132cd7071b1a47960e0b970f18ff`
BLAKE2b-256	`0347382eb880ab9c25f1945704b8e499e2dc32c964c3ca16770cd9c2bbd40c68`

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

Download URL: fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_arm64.whl
Upload date: Feb 19, 2026
Size: 204.2 kB
Tags: CPython 3.11, macOS 14.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm	Hash digest
SHA256	`5a67746ff27374b1fba19d2b89d89566f5f10e33dc57d513000db8beec36bace`
MD5	`90631872c873e0bb05fcbcae5e110f52`
BLAKE2b-256	`ee91e23c50db9e30a386ff855020eaa95231be9ffe18499290ee117ae4f0f71d`

See more details on using hashes here.

FastSketchLSH 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

FastSketchLSH

Introduction

Headline Results

What's New in v0.2.0

How It Works

Installation

PyPI (recommended)

Build from source

Quick Start

Sketch two sets and estimate their Jaccard similarity

Deduplication with LSH

Experiment Summaries

Key Points

Future Work

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes