Skip to main content

High-performance FastSketch with SIMD acceleration to deduplicate large-scale data

Project description

FastSketchLSH

Introduction

FastSketchLSH delivers a Python-first package that wraps a high-performance C++/SIMD implementation of Fast Similarity Sketch (see Dahlgaard et al., FOCS'17 arXiv:1704.04370 for the underlying algorithm). The goal is to make Jaccard estimation and locality-sensitive hashing (LSH) practical for large dataset deduplication.

FastSimilaritySketch throughput advantage

Headline Results

  • FastSimilaritySketch maintains sub-millisecond sketch times even when each set holds 1 600 tokens, keeping the absolute Jaccard error around 0.03–0.06.
  • At the sketch level, FastSimilaritySketch stays 200×–990× faster than datasketch MinHash while matching its accuracy on the included microbenchmarks.
  • Ground-truth comparisons confirm FastSketchLSH remains competitive on deduplication quality.

Current package version: 1.0.1. This patch release keeps the v1.0.0 API surface unchanged and adds pytest collection isolation so default test runs stay on the local extension under active development.

From v0.2.0 to v1.0.0

v1.0.0 keeps the same native kernels but simplifies the public Python API and adds a true LSH one-shot duplicate path. The main user-visible changes are:

  • FastSimilaritySketch(sketch_size=...) -> FastSimilaritySketch(size=...)
  • sketch(...) -> __call__(...)
  • sketch_batch(...) -> batch(...)
  • sketch_batch_flat_csr_prehashed(...) -> batch_csr(..., prehashed=True)
  • build_from_batch(...) -> insert(...)
  • query_candidates(...) / batch_query_csr(...) -> query(...)
  • New: insert_and_query_duplicates(...) for fused LSH build + duplicate flagging

The full upgrade record lives in docs/release-notes-v0.2.0-to-v1.0.0.md.

Pre-hashed inputs are still supported; they are now exposed through the unified prehashed=True flag:

import numpy as np
from FastSketchLSH import FastSimilaritySketch

sketcher = FastSimilaritySketch(size=256, seed=42)

# Single sketch from pre-hashed values (zero-copy from NumPy)
hashes = np.array([0xDEAD, 0xBEEF, 0xCAFE, ...], dtype=np.uint64)
digest = sketcher(hashes, prehashed=True)

# Batch of pre-hashed arrays
batch = [np.array([...], dtype=np.uint64) for _ in range(1000)]
digests = sketcher.batch(batch, prehashed=True, num_threads=8)

# CSR layout for maximum throughput
data = np.array([...], dtype=np.uint64)
indptr = np.array([0, 120, 250, 500], dtype=np.uint64)
digests = sketcher.batch_csr(data, indptr, prehashed=True, num_threads=8)

All prehashed paths share the same SIMD-accelerated Round 1 / Round 2 bucket-fill logic and OpenMP batch parallelism as the token-based paths.

How It Works

  • Fast Similarity Sketching: SIMD-accelerated permutations compress a set into a fixed-length signature, expected time O(n + k log k) with O(k) space.
  • Banded LSH: Signature rows are grouped into bands; items colliding in any band become candidates for deduplication.
  • Python ergonomics: Thin wrappers expose the C++ core, plus reference implementations of competing sketches for fair comparisons.

Installation

Prerequisite: Python 3.11 or newer. Support for Python 3.8 and older interpreters is on the roadmap.

PyPI (recommended)

pip install fastsketchlsh

Build from source

  1. Build the native extension:
    cd fastsketchlsh_ext
    pip install .
    
    This installs the FastSketchLSH Python module with SIMD kernels.
  2. Install benchmark utilities (optional for reproducing experiments):
    pip install -r requirements.txt
    
  3. Activate your environment (e.g. source .venv/bin/activate) before running scripts.

Quick Start

Sketch two sets and estimate their Jaccard similarity

from FastSketchLSH import FastSimilaritySketch, estimate_jaccard

# Build list_a with 16,000 tokens labeled "a-0" to "a-15999"
# Build list_b with 8,000 overlapping + 8,000 new tokens (true Jaccard = 1/3)
list_a = [f"a-{i}" for i in range(16_000)]
list_b = [f"a-{i}" for i in range(8_000)] + [f"b-{i}" for i in range(8_000)]

sketcher = FastSimilaritySketch(size=256)
sig_a = sketcher(list_a)
sig_b = sketcher(list_b)

estimated = estimate_jaccard(sig_a, sig_b)
print(f"Estimated Jaccard similarity: {estimated:.4f}")

Deduplication with LSH

This end-to-end sample downloads a small slice of Hugging Face’s lucadiliello/bookcorpusopen corpus, sketches every document with k=128, and groups the signatures into 16 bands. Sketching each document costs O(n + k log k) time with O(k) space, while an LSH probe runs in O(k + c) where c is the number of retrieved candidates. For pure duplicate-flagging, insert_and_query_duplicates(...) fuses LSH build and duplicate detection into one pass.

from __future__ import annotations

from datasets import load_dataset
from FastSketchLSH import FastSimilaritySketch, LSH


def tokenize(text: str) -> list[str]:
    return sorted({token for token in text.lower().split() if token})


# Here, 'train[:2048]' tells Hugging Face Datasets to select only the first 2048 rows from the 'train' split.
dataset = load_dataset(
    "lucadiliello/bookcorpusopen",
    split="train[:2048]")

texts = [row["text"] for row in dataset if row.get("text")]
token_sets = [tokenize(text) for text in texts]

sketcher = FastSimilaritySketch(size=128, seed=42)
sketch_matrix = sketcher.batch(token_sets)

lsh = LSH(num_perm=128, num_bands=16)
# One-shot path: insert the batch and return duplicate flags immediately.
dup_flags = lsh.insert_and_query_duplicates(sketch_matrix).tolist()

doc_idx = 0
candidates = lsh.query(sketch_matrix[doc_idx])
print(f"Candidates for {doc_idx}:", candidates)

print("Duplicate flags:", dup_flags)
print("Total duplicates detected:", sum(dup_flags))

Experiment Summaries

  • Sketch microbenchmarks (exps/sketch/): Full write-up, CSVs, and plotting helpers demonstrating latency and accuracy versus standard MinHash baselines. Reproduction steps live in exps/sketch/README.md.
  • Ground-truth accuracy (exps/accuracy/): Jaccard estimation and dedup quality measured against labelled datasets. See exps/accuracy/README.md for reproduction commands.
  • End-to-end pipelines (exps/end2end/): Thread-scaled deduplication sweeps on large corpora, plus scripts for batch comparisons. The current single-thread FastSketch path uses true LSH one-shot duplicate flagging. Details in exps/end2end/README.md.

Each experiment directory includes figures, CSV outputs, and exact command lines so you can replicate every result.

Key Points

  • FastSketchLSH packages a SIMD-backed sketch with Python convenience wrappers.
  • Headline benchmarks show up to 990× throughput gains over classic MinHash at comparable accuracy.
  • Ready-to-run examples cover sketching, LSH-based deduplication, and full dataset experiments.
  • For deeper reproduction details, consult the README in each experiment subdirectory.

Future Work

  • A MapReduce/Spark demo to deduplicate large datasets in distributed systems.
  • A friendlier Python interface aligned with datasketch ergonomics.

License

MIT. Research and educational use welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsketchlsh-1.0.1.tar.gz (55.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastsketchlsh-1.0.1-cp312-cp312-win_amd64.whl (140.0 kB view details)

Uploaded CPython 3.12Windows x86-64

fastsketchlsh-1.0.1-cp312-cp312-win32.whl (128.2 kB view details)

Uploaded CPython 3.12Windows x86

fastsketchlsh-1.0.1-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_x86_64.whl (472.6 kB view details)

Uploaded CPython 3.12macOS 14.0+ x86-64

fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_arm64.whl (176.8 kB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

fastsketchlsh-1.0.1-cp311-cp311-win_amd64.whl (138.5 kB view details)

Uploaded CPython 3.11Windows x86-64

fastsketchlsh-1.0.1-cp311-cp311-win32.whl (127.3 kB view details)

Uploaded CPython 3.11Windows x86

fastsketchlsh-1.0.1-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_x86_64.whl (470.4 kB view details)

Uploaded CPython 3.11macOS 14.0+ x86-64

fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_arm64.whl (175.9 kB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

File details

Details for the file fastsketchlsh-1.0.1.tar.gz.

File metadata

  • Download URL: fastsketchlsh-1.0.1.tar.gz
  • Upload date:
  • Size: 55.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-1.0.1.tar.gz
Algorithm Hash digest
SHA256 19841ca74853ec0c4fa8bdcdb3dadec3613c524c5fbab05c6aef99017a0f6bf7
MD5 db59d389f5d240cbc4a8d075102d2fb5
BLAKE2b-256 8497e12ebd043525836d29dd72218be78b1c113fdf4bdb3245259b39bf99a1cb

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 936f7adf902f5ee2372d58d22270368b17d4b61b0c886169d8be4ad62fd6a1ca
MD5 a60367dad59a26f235a006c1fa9d5940
BLAKE2b-256 bcbff13ac3b133708c50175f39c7a6570783772880cc589805942b4bd6d14e9b

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp312-cp312-win32.whl.

File metadata

  • Download URL: fastsketchlsh-1.0.1-cp312-cp312-win32.whl
  • Upload date:
  • Size: 128.2 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-1.0.1-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 553570d220d32ca1abc2239ceccd3d82d3e35d771b3cd57be480e51f174fad59
MD5 4bd15323b273f3b5a4dd542f6ec68dde
BLAKE2b-256 4ca0571038fbfd95fb8f59b8bc9cb9c7d50b61693895492ca35e89f502b57e1c

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0328cf36fed22302449a258595142ea98f5c9cf12d061553551fe11248c6dbff
MD5 112a6bf561502c6214e9991052801925
BLAKE2b-256 255b56f7e085b9ab0ea9970cf18268c63bc504725089f6d4bb603dd7ca95fb94

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 73e4349beb62ffe1b43af73229ec9b09790ea1b1a0c43a2f5e65f94e7bcae69d
MD5 bb87cb20bdcfa76668b58dd6c5fc6240
BLAKE2b-256 747dc4d741e59c132f3b5ce9dcb24bf46b7430d0f7598e6e154a30e54a095d7f

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 93cc190a95010bb14c8852d4a92d730961fa876ec2e040d095ea0564d3115915
MD5 3b41de64052f784a270650cd10fc23c1
BLAKE2b-256 534e5af64ab17f85347259a88af3abc11ced0adeb73654b6d4c29c3107a1563b

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ae392b78d1e30450f4eed15fcdf05733b22eff21d4356a7117fc8712855c2d4f
MD5 cfbaf1a23e3da79f1cca1b43770527f8
BLAKE2b-256 66b76f7f362a80fd863571d5023312eaf3b714b46edf7550a03058f32212342f

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp311-cp311-win32.whl.

File metadata

  • Download URL: fastsketchlsh-1.0.1-cp311-cp311-win32.whl
  • Upload date:
  • Size: 127.3 kB
  • Tags: CPython 3.11, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-1.0.1-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 e467ccff7302875b067534b447fabf54fe8c4b6770a13f84be08a57d330bd4c3
MD5 1e1fa16d9acdc6cd5bad35987dda4ca8
BLAKE2b-256 63cd2b86133e658730f176ab8c736f94128e7449e358bcc3a2e72641776c269f

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0b369cbcd8b6b4cce9b082eae560371234360e7d875ca7a5cc187c84e8b197a3
MD5 d18be3f2531df945922bbe61cbec1428
BLAKE2b-256 dcea2ff92db31dd79936ff4dc03c52e6831effe381780a8c324ab4ea739f8c78

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 4d5e655563cca60f79d50ed9830fa60f62b92914a9a632057d7bc37c5a4e1afb
MD5 c352a90113bd4913aea52273c9bb1437
BLAKE2b-256 8084ad90fe2ff1c6d89e4344687dc1b88a8dc18260e6191cbcc55af1bbeeb8b2

See more details on using hashes here.

File details

Details for the file fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-1.0.1-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 3e42b222c47b4a789594eb7b5196f5a7b065a379366b7f831118761506a4eb3c
MD5 c6a07fc85b561282b4e5732fe678decf
BLAKE2b-256 0c99beab0aac79103b8b8dc441fab27a4c81cc6899f106d7c08ecd0476eb0f51

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page