Skip to main content

High-performance FastSketch with SIMD acceleration to deduplicate large-scale data

Project description

FastSketchLSH

Introduction

FastSketchLSH delivers a Python-first package that wraps a high-performance C++/SIMD implementation of Fast Similarity Sketch (see Dahlgaard et al., FOCS'17 arXiv:1704.04370 for the underlying algorithm). The goal is to make Jaccard estimation and locality-sensitive hashing (LSH) practical for large dataset deduplication.

FastSimilaritySketch throughput advantage

Dataset Engine Sketch (s) Build (s) Query (s) Total (s) FastSketchLSH Sketch Speedup FastSketchLSH Total Speedup
BOOKCORPUSOPEN rensa 198.545 0.026 0.018 198.589 - -
BOOKCORPUSOPEN fastsketchlsh 55.280 0.039 0.031 55.350 3.59× 3.59×
BOOKS3 rensa 95.915 0.005 0.003 95.923 - -
BOOKS3 fastsketchlsh 28.440 0.008 0.007 28.455 3.37× 3.37×
PINECONE rensa 3.929 0.141 0.153 4.223 - -
PINECONE fastsketchlsh 1.521 0.249 0.396 2.166 2.58× 1.95×
SHUYUEJ rensa 3.749 0.037 0.044 3.830 - -
SHUYUEJ fastsketchlsh 1.132 0.093 0.121 1.346 3.31× 2.85×

Headline Results

  • FastSimilaritySketch maintains sub-millisecond sketch times even when each set holds 1 600 tokens, keeping the absolute Jaccard error around 0.03–0.06.
  • At the sketch level, FastSimilaritySketch stays 200×–990× faster than datasketch MinHash and still 8×–23× faster than Rensa’s CMinHash/RMinHash, while matching their accuracy—these gains matter most for large documents.
  • End-to-end deduplication experiments show FastSketchLSH is typically ~2×–3.6× faster than Rensa in single-thread runs.
  • Ground-truth comparisons confirm FastSketchLSH matches or slightly exceeds the deduplication accuracy of both Rensa and datasketch.

How It Works

  • Fast Similarity Sketching: SIMD-accelerated permutations compress a set into a fixed-length signature, expected time O(n + k log k) with O(k) space.
  • Banded LSH: Signature rows are grouped into bands; items colliding in any band become candidates for deduplication.
  • Python ergonomics: Thin wrappers expose the C++ core, plus reference implementations of competing sketches for fair comparisons.

Installation

Prerequisite: Python 3.11 or newer. Support for Python 3.8 and older interpreters is on the roadmap.

PyPI (recommended)

pip install fastsketchlsh

Build from source

  1. Build the native extension:
    cd fastsketchlsh_ext
    pip install .
    
    This installs the FastSketchLSH Python module with SIMD kernels.
  2. Install benchmark utilities (optional for reproducing experiments):
    pip install -r requirements.txt
    
  3. Activate your environment (e.g. source .venv/bin/activate) before running scripts.

Quick Start

Sketch two sets and estimate their Jaccard similarity

from FastSketchLSH import FastSimilaritySketch, estimate_jaccard

# Build list_a with 16,000 tokens labeled "a-0" to "a-15999"
# Build list_b with 8,000 overlapping + 8,000 new tokens (true Jaccard = 1/3)
list_a = [f"a-{i}" for i in range(16_000)]
list_b = [f"a-{i}" for i in range(8_000)] + [f"b-{i}" for i in range(8_000)]

sketcher = FastSimilaritySketch(sketch_size=256)
sig_a = sketcher.sketch(list_a)
sig_b = sketcher.sketch(list_b)

estimated = estimate_jaccard(sig_a, sig_b)
print(f"Estimated Jaccard similarity: {estimated:.4f}")

Deduplication with LSH

This end-to-end sample downloads a small slice of Hugging Face’s lucadiliello/bookcorpusopen corpus, sketches every document with k=128, and groups the signatures into 16 bands. Sketching each document costs O(n + k log k) time with O(k) space, while an LSH probe runs in O(k + c) where c is the number of retrieved candidates.

from __future__ import annotations

from datasets import load_dataset
from FastSketchLSH import FastSimilaritySketch, LSH


def tokenize(text: str) -> list[str]:
    return sorted({token for token in text.lower().split() if token})


# Here, 'train[:2048]' tells Hugging Face Datasets to select only the first 2048 rows from the 'train' split.
dataset = load_dataset(
    "lucadiliello/bookcorpusopen",
    split="train[:2048]")

texts = [row["text"] for row in dataset if row.get("text")]
token_sets = [tokenize(text) for text in texts]

sketcher = FastSimilaritySketch(sketch_size=128, seed=42)
# Use batch mode for faster sketching (much faster than one-by-one)
sketch_matrix = sketcher.sketch_batch(token_sets)

lsh = LSH(num_perm=128, num_bands=16)
lsh.build_from_batch(sketch_matrix)

doc_idx = 0
candidates = lsh.query_candidates(sketch_matrix[doc_idx])
print(f"Candidates for {doc_idx}:", candidates)

dup_flags = [1 if len(lsh.query_candidates(row)) > 1 else 0 for row in sketch_matrix]
print("Duplicate flags:", dup_flags)
print("Total duplicates detected:", sum(dup_flags))

Experiment Summaries

  • Sketch microbenchmarks (exps/sketch/): Full write-up, CSVs, and plotting helpers demonstrating latency and accuracy versus datasketch and Rensa baselines. Reproduction steps live in exps/sketch/README.md.
  • Ground-truth accuracy (exps/accuracy/): Jaccard estimation and dedup quality measured against labelled datasets. See exps/accuracy/README.md for reproduction commands.
  • End-to-end pipelines (exps/end2end/): Thread-scaled deduplication sweeps on large corpora, plus scripts for batch comparisons. Details in exps/end2end/README.md.

Each experiment directory includes figures, CSV outputs, and exact command lines so you can replicate every result.

Key Points

  • FastSketchLSH packages a SIMD-backed sketch with Python convenience wrappers.
  • Headline benchmarks show up to 990× throughput gains over classic MinHash at comparable accuracy.
  • Ready-to-run examples cover sketching, LSH-based deduplication, and full dataset experiments.
  • For deeper reproduction details, consult the README in each experiment subdirectory.

Future Work

  • A MapReduce/Spark demo to deduplicate large datasets in distributed systems.
  • A friendlier Python interface aligned with datasketch ergonomics.

License

MIT. Research and educational use welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsketchlsh-0.1.2.tar.gz (53.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastsketchlsh-0.1.2-cp312-cp312-win_amd64.whl (142.5 kB view details)

Uploaded CPython 3.12Windows x86-64

fastsketchlsh-0.1.2-cp312-cp312-win32.whl (131.8 kB view details)

Uploaded CPython 3.12Windows x86

fastsketchlsh-0.1.2-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_x86_64.whl (488.1 kB view details)

Uploaded CPython 3.12macOS 14.0+ x86-64

fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_arm64.whl (194.2 kB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

fastsketchlsh-0.1.2-cp311-cp311-win_amd64.whl (141.6 kB view details)

Uploaded CPython 3.11Windows x86-64

fastsketchlsh-0.1.2-cp311-cp311-win32.whl (130.8 kB view details)

Uploaded CPython 3.11Windows x86

fastsketchlsh-0.1.2-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.26+ x86-64manylinux: glibc 2.28+ x86-64

fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_x86_64.whl (486.0 kB view details)

Uploaded CPython 3.11macOS 14.0+ x86-64

fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_arm64.whl (193.6 kB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

File details

Details for the file fastsketchlsh-0.1.2.tar.gz.

File metadata

  • Download URL: fastsketchlsh-0.1.2.tar.gz
  • Upload date:
  • Size: 53.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.1.2.tar.gz
Algorithm Hash digest
SHA256 00c16e13d4b6fe0d32152bb532e825c401b7a2764dddc44954ed8a346c043ae0
MD5 686da37ec55d81837f98f2c28acc8355
BLAKE2b-256 5815709e2ff4630d571eb6c7f78ba3ec3b0faaa100a8fcd7cde23d3c036ff7ec

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ed387764c73aa56633d726407d2c6f6bd12cefef05b3870b3cbc6afae928a585
MD5 c83d98da930d80ec03e9bbe8778ee687
BLAKE2b-256 d8f09d6c28276f1a94083bdbf8c64fb8cbba3659985578c078edb5e7dfabb949

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp312-cp312-win32.whl.

File metadata

  • Download URL: fastsketchlsh-0.1.2-cp312-cp312-win32.whl
  • Upload date:
  • Size: 131.8 kB
  • Tags: CPython 3.12, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.1.2-cp312-cp312-win32.whl
Algorithm Hash digest
SHA256 890f58d9819a804b17e74a31ce4064602b1d8ac906d001966741d71670d492d7
MD5 0bf572c3446c09df9f2ab42098b61e03
BLAKE2b-256 26a99df25bc9bb93115e2cb6a2b63b441ec51d213adb22d1046870b5dc334e0e

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp312-cp312-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 38b5856e8c32a08293bd10cfc7474b345699697569e04dc017b31973fe4d2be3
MD5 097b1a2b76f8dadc5720fe414722fd2c
BLAKE2b-256 c6c5c06cb4689ec6ceee8d44f05776e733d986e3879945afe5715344ba961fd5

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 fa354d0d30a289464886caa513e3c50c8b89ace71fa9b813e4862cdb40d91c20
MD5 68ff14422fad4826d054f24e165e8ada
BLAKE2b-256 b7ff0a342806936f168fd8839fa157e88cda52d06b736062cbebe96fe56b29a8

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 76a6249b7be5f050647621fcb94424520eceedda7598e4ecde9dbe884076fd00
MD5 0e1e71f246bbdab2e029fd0004f22840
BLAKE2b-256 fce926ed4fa034a0cc69e6472b15094bb71414120eb7ba1b620f55691903d4dc

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 65df43a1c1a626f4ba29a35862069d0e86beee29d63a29130749673b05105240
MD5 6b2e59164463b920bf3347ddad2ab8c8
BLAKE2b-256 5a0c3b484a58603dbb6568d377d6d25c57b562839a6b05f9a3804faed3cd8476

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp311-cp311-win32.whl.

File metadata

  • Download URL: fastsketchlsh-0.1.2-cp311-cp311-win32.whl
  • Upload date:
  • Size: 130.8 kB
  • Tags: CPython 3.11, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastsketchlsh-0.1.2-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 0506d75ddcd6df23ebaa938e9d7c8498d6fb94ee6517d651566fc3e86e034632
MD5 5bb7b4da55e007fb3117241f9c9ccd72
BLAKE2b-256 b7a11252097ee7c1b306fb02a0e135a359335cd45a6bf1673d3f9b919b80c14c

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp311-cp311-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9fb147c22a09cdc3022a490bfce7f260befcea75decec7de89df86848a42e5e7
MD5 0cadd4563d4be04f8bc1b7f154a2e827
BLAKE2b-256 5a01a8ca5349696dcde923da1c640ff71d849dbad85082a0065535ee48c05210

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 ee23b5670356dc5db55b2e6dc32ce0c9c70157b070d08ef2b1afacfbe38095f0
MD5 a2f6e6a06f24915f08f7e14ed33de469
BLAKE2b-256 57bc726039399de816d51b9b52d0d64ee303e7a4d3a9f28b270c0dc897cc35f9

See more details on using hashes here.

File details

Details for the file fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for fastsketchlsh-0.1.2-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 a3ae24ed664c9feae2d41b5901aef5dbbfe467cadb92d47dc6c774455b8d5c4e
MD5 1fe686167f1ac22f9543091dde40989d
BLAKE2b-256 c93d8f0c5f2075ea32aacd03f4ad7b51aa1f8676eecee229b5701d6f01874fd2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page