Skip to main content

High-performance FastSketch with SIMD acceleration to deduplicate large-scale data

Project description

FastSketchLSH

Introduction

FastSketchLSH delivers a Python-first package that wraps a high-performance C++/SIMD implementation of Fast Similarity Sketch (see Dahlgaard et al., FOCS'17 arXiv:1704.04370 for the underlying algorithm). The goal is to make Jaccard estimation and locality-sensitive hashing (LSH) practical for large dataset deduplication.

FastSimilaritySketch throughput advantage

Dataset Engine Sketch (s) Build (s) Query (s) Total (s) FastSketchLSH Sketch Speedup FastSketchLSH Total Speedup
BOOKCORPUSOPEN rensa 198.545 0.026 0.018 198.589 - -
BOOKCORPUSOPEN fastsketchlsh 55.280 0.039 0.031 55.350 3.59× 3.59×
BOOKS3 rensa 95.915 0.005 0.003 95.923 - -
BOOKS3 fastsketchlsh 28.440 0.008 0.007 28.455 3.37× 3.37×
PINECONE rensa 3.929 0.141 0.153 4.223 - -
PINECONE fastsketchlsh 1.521 0.249 0.396 2.166 2.58× 1.95×
SHUYUEJ rensa 3.749 0.037 0.044 3.830 - -
SHUYUEJ fastsketchlsh 1.132 0.093 0.121 1.346 3.31× 2.85×

Headline Results

  • FastSimilaritySketch maintains sub-millisecond sketch times even when each set holds 1 600 tokens, keeping the absolute Jaccard error around 0.03–0.06.
  • At the sketch level, FastSimilaritySketch stays 200×–990× faster than datasketch MinHash and still 8×–23× faster than Rensa’s CMinHash/RMinHash, while matching their accuracy—these gains matter most for large documents.
  • End-to-end deduplication experiments show FastSketchLSH is typically ~2×–3.6× faster than Rensa in single-thread runs.
  • Ground-truth comparisons confirm FastSketchLSH matches or slightly exceeds the deduplication accuracy of both Rensa and datasketch.

What's New in v0.2.0

Pre-hashed input support -- sketch_prehashed, sketch_batch_prehashed, and sketch_batch_flat_csr_prehashed methods now accept np.uint64 or np.int64 arrays of user-provided hash values directly. This skips the internal prehash phase entirely (no hash_int32, no fnv1a64), which is useful when you hash tokens yourself or reuse hash values across different sketch configurations.

import numpy as np
from FastSketchLSH import FastSimilaritySketch

sketcher = FastSimilaritySketch(sketch_size=256, seed=42)

# Single sketch from pre-hashed values (zero-copy from NumPy)
hashes = np.array([0xDEAD, 0xBEEF, 0xCAFE, ...], dtype=np.uint64)
digest = sketcher.sketch_prehashed(hashes)

# Batch of pre-hashed arrays
batch = [np.array([...], dtype=np.uint64) for _ in range(1000)]
digests = sketcher.sketch_batch_prehashed(batch, num_threads=8)

# CSR layout for maximum throughput
data = np.array([...], dtype=np.uint64)
indptr = np.array([0, 120, 250, 500], dtype=np.uint64)
digests = sketcher.sketch_batch_flat_csr_prehashed(data, indptr, num_threads=8)

All prehashed paths share the same SIMD-accelerated Round 1 / Round 2 bucket-fill logic and OpenMP batch parallelism as the existing sketch methods.

How It Works

  • Fast Similarity Sketching: SIMD-accelerated permutations compress a set into a fixed-length signature, expected time O(n + k log k) with O(k) space.
  • Banded LSH: Signature rows are grouped into bands; items colliding in any band become candidates for deduplication.
  • Python ergonomics: Thin wrappers expose the C++ core, plus reference implementations of competing sketches for fair comparisons.

Installation

Prerequisite: Python 3.11 or newer. Support for Python 3.8 and older interpreters is on the roadmap.

PyPI (recommended)

pip install fastsketchlsh

Build from source

  1. Build the native extension:
    cd fastsketchlsh_ext
    pip install .
    
    This installs the FastSketchLSH Python module with SIMD kernels.
  2. Install benchmark utilities (optional for reproducing experiments):
    pip install -r requirements.txt
    
  3. Activate your environment (e.g. source .venv/bin/activate) before running scripts.

Quick Start

Sketch two sets and estimate their Jaccard similarity

from FastSketchLSH import FastSimilaritySketch, estimate_jaccard

# Build list_a with 16,000 tokens labeled "a-0" to "a-15999"
# Build list_b with 8,000 overlapping + 8,000 new tokens (true Jaccard = 1/3)
list_a = [f"a-{i}" for i in range(16_000)]
list_b = [f"a-{i}" for i in range(8_000)] + [f"b-{i}" for i in range(8_000)]

sketcher = FastSimilaritySketch(sketch_size=256)
sig_a = sketcher.sketch(list_a)
sig_b = sketcher.sketch(list_b)

estimated = estimate_jaccard(sig_a, sig_b)
print(f"Estimated Jaccard similarity: {estimated:.4f}")

Deduplication with LSH

This end-to-end sample downloads a small slice of Hugging Face’s lucadiliello/bookcorpusopen corpus, sketches every document with k=128, and groups the signatures into 16 bands. Sketching each document costs O(n + k log k) time with O(k) space, while an LSH probe runs in O(k + c) where c is the number of retrieved candidates.

from __future__ import annotations

from datasets import load_dataset
from FastSketchLSH import FastSimilaritySketch, LSH


def tokenize(text: str) -> list[str]:
    return sorted({token for token in text.lower().split() if token})


# Here, 'train[:2048]' tells Hugging Face Datasets to select only the first 2048 rows from the 'train' split.
dataset = load_dataset(
    "lucadiliello/bookcorpusopen",
    split="train[:2048]")

texts = [row["text"] for row in dataset if row.get("text")]
token_sets = [tokenize(text) for text in texts]

sketcher = FastSimilaritySketch(sketch_size=128, seed=42)
# Use batch mode for faster sketching (much faster than one-by-one)
sketch_matrix = sketcher.sketch_batch(token_sets)

lsh = LSH(num_perm=128, num_bands=16)
lsh.build_from_batch(sketch_matrix)

doc_idx = 0
candidates = lsh.query_candidates(sketch_matrix[doc_idx])
print(f"Candidates for {doc_idx}:", candidates)

dup_flags = [1 if len(lsh.query_candidates(row)) > 1 else 0 for row in sketch_matrix]
print("Duplicate flags:", dup_flags)
print("Total duplicates detected:", sum(dup_flags))

Experiment Summaries

  • Sketch microbenchmarks (exps/sketch/): Full write-up, CSVs, and plotting helpers demonstrating latency and accuracy versus datasketch and Rensa baselines. Reproduction steps live in exps/sketch/README.md.
  • Ground-truth accuracy (exps/accuracy/): Jaccard estimation and dedup quality measured against labelled datasets. See exps/accuracy/README.md for reproduction commands.
  • End-to-end pipelines (exps/end2end/): Thread-scaled deduplication sweeps on large corpora, plus scripts for batch comparisons. Details in exps/end2end/README.md.

Each experiment directory includes figures, CSV outputs, and exact command lines so you can replicate every result.

Key Points

  • FastSketchLSH packages a SIMD-backed sketch with Python convenience wrappers.
  • Headline benchmarks show up to 990× throughput gains over classic MinHash at comparable accuracy.
  • Ready-to-run examples cover sketching, LSH-based deduplication, and full dataset experiments.
  • For deeper reproduction details, consult the README in each experiment subdirectory.

Future Work

  • A MapReduce/Spark demo to deduplicate large datasets in distributed systems.
  • A friendlier Python interface aligned with datasketch ergonomics.

License

MIT. Research and educational use welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsketchlsh-0.2.0.tar.gz (55.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastsketchlsh-0.2.0-cp312-cp312-win_amd64.whl (151.4 kB view details)

Uploaded CPython 3.12Windows x86-64

fastsketchlsh-0.2.0-cp312-cp312-win32.whl (137.8 kB view details)

Uploaded CPython 3.12Windows x86

fastsketchlsh-0.2.0-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_x86_64.whl (500.3 kB view details)

Uploaded CPython 3.12macOS 14.0+ x86-64

fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_arm64.whl (204.9 kB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

fastsketchlsh-0.2.0-cp311-cp311-win_amd64.whl (150.1 kB view details)

Uploaded CPython 3.11Windows x86-64

fastsketchlsh-0.2.0-cp311-cp311-win32.whl (136.5 kB view details)

Uploaded CPython 3.11Windows x86

fastsketchlsh-0.2.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_x86_64.whl (497.9 kB view details)

Uploaded CPython 3.11macOS 14.0+ x86-64

fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_arm64.whl (204.2 kB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

File details

Details for the file fastsketchlsh-0.2.0.tar.gz.

File metadata

  • Download URL: fastsketchlsh-0.2.0.tar.gz
  • Upload date:
  • Size: 55.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6e772cf512ca21c4e8ecc4faf42e36cb5902465899651fcdf836953098633332
MD5 2326367ae608440bedc24ec0cca7d68e
BLAKE2b-256 2143c574277552d9b9062e7d29f4236b2b901beab1aaa4b5194c7a562c553a23

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6d118c7c9e11932767e5dbbe84131319ddc6dd12d4d4388ca879b9c91f42126c
MD5 9ebbfcbcd686d96a07f65f02ace3db0a
BLAKE2b-256 7d0975f02dfc670126fff6a444ac981f95b9a3970ba7b71b3353fcb5b6c0ee1f

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-win32.whl.

File metadata

  • Download URL: fastsketchlsh-0.2.0-cp312-cp312-win32.whl
  • Upload date:
  • Size: 137.8 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 f76217580114dd3c21629a7c2e2af6478a150ac98e2f7ea1c9c7e72dd2da7f57
MD5 b599b9df329cf52b375d7b710933b220
BLAKE2b-256 e7131e0a3f600c91b51914d24a2161b8b586badcb9c18578ba2e6b3229d10c87

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2ff5886193ad930b045912060ad391fdbd673dd70761f5677c8d7a50722ebf4a
MD5 d340834e65a537e397a7ae2b14171bd9
BLAKE2b-256 0628c453cee0af0f690c229ff0bc24b9614db91273ee0935225006a6b3ba290a

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 955f2e43e740f37923beb47a7e1245671ded26f71c15cdbd87a35b5b67d50008
MD5 f39c0966aed9c4cb2d978d13e35adad1
BLAKE2b-256 f22cbd0fbf76fd59b5ce3cf0efd15dea9cdc0ba9e1ffb602a715bdae2e8bc82a

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 99955ad343b74c5689b7d9d5a0a23834077988d06030856864f7311bdeb3e0d7
MD5 856b77e8986e90a63bc80f8d4a5da738
BLAKE2b-256 a552a0488df428b93cb2de90d4a35e8d22e0b2926ee876a010ada624d6b11ac1

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 beaa8f6902e4d6cd7c283a3092673ed2513c678976cef87ec6839b76d2d58340
MD5 02ad72ad539312b98886ac29fe601e74
BLAKE2b-256 090ec5c761523cf085d4956c20f43b7f0e9a3a867b16c3d8bdafc48ec3a935f1

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-win32.whl.

File metadata

  • Download URL: fastsketchlsh-0.2.0-cp311-cp311-win32.whl
  • Upload date:
  • Size: 136.5 kB
  • Tags: CPython 3.11, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 00799bd9d4b6361dc52dd67203325e34e6dad411a56bec3de54abb6c7c4bff05
MD5 ae102e0c7aff3228c0ef77202a319fe4
BLAKE2b-256 99dfb0a5b1265770a9b5e8705a47b6b1bce140572f1b8dc7fa466dad56ce4028

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 54b1a3d189cb768a4fe98317146623ef362c35736af6c190193f73d5d37f80c2
MD5 f756bd6798eafb74be4a5b4ced988f0f
BLAKE2b-256 b1893baefe13091657d0e9b7b019f56d8ea361c6495e0eddd79f5042d1e1a94c

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 62f2082de4d73e0121b969727dab7be342a2a273541c33945b78bf18944521c2
MD5 f190132cd7071b1a47960e0b970f18ff
BLAKE2b-256 0347382eb880ab9c25f1945704b8e499e2dc32c964c3ca16770cd9c2bbd40c68

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.2.0-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 5a67746ff27374b1fba19d2b89d89566f5f10e33dc57d513000db8beec36bace
MD5 90631872c873e0bb05fcbcae5e110f52
BLAKE2b-256 ee91e23c50db9e30a386ff855020eaa95231be9ffe18499290ee117ae4f0f71d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page