Skip to main content

High-performance FastSketch with SIMD acceleration to deduplicate large-scale data

Project description

Installation

You can install FastHashSketch using pip. It's available in all platforms:

pip install .

TODO

  • Return NumPy ndarray when input is NumPy ndarray for single-set sketch overloads (np.uint32/np.int32 inputs).

Usage Example

from FastSketchLSH import FastSimilaritySketch

def estimate_jaccard(sketch1, sketch2):
    if len(sketch1) != len(sketch2):
        raise ValueError("Sketches must have the same length to compare.")
    matches = sum(1 for i in range(len(sketch1)) if sketch1[i] == sketch2[i])
    return matches / len(sketch1)

if __name__ == '__main__':
    t = 256
    A = set(range(0, 1000))
    B = set(range(500, 1500))
    sketcher = FastSimilaritySketch(sketch_size=t)
    S_A = sketcher.sketch(A)
    S_B = sketcher.sketch(B)
    est_j = estimate_jaccard(S_A, S_B)
    print(f"Estimated Jaccard: {est_j:.4f}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsketchlsh-0.1.0.tar.gz (50.2 kB view details)

Uploaded Source

File details

Details for the file fastsketchlsh-0.1.0.tar.gz.

File metadata

  • Download URL: fastsketchlsh-0.1.0.tar.gz
  • Upload date:
  • Size: 50.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for fastsketchlsh-0.1.0.tar.gz
Algorithm Hash digest
SHA256 57b98ab2641ec8f0cbe5d93bd8cead34df82c0f626da82ff6d1f8800d6c55836
MD5 07c59e404ac1b24cd0c1a5901430a88c
BLAKE2b-256 77b65227def4c1a1a271dd8d4b4b4983acb0552f64f09b087d30a29eeb9da9ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page