Skip to main content

Probabilistic data structures for processing and searching very large datasets

Project description

https://github.com/ekzhu/datasketch/workflows/Python%20package/badge.svg https://zenodo.org/badge/DOI/10.5281/zenodo.290602.svg

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch

Usage

MinHash

estimate Jaccard similarity and cardinality

Weighted MinHash

estimate weighted Jaccard similarity

HyperLogLog

estimate cardinality

HyperLogLog++

estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index

For Data Sketch

Supported Query Type

MinHash LSH

MinHash, Weighted MinHash

Jaccard Threshold

MinHash LSH Forest

MinHash, Weighted MinHash

Jaccard Top-K

MinHash LSH Ensemble

MinHash

Containment Threshold

datasketch must be used with Python 2.7 or above and NumPy 1.11 or above. Scipy is optional, but with it the LSH initialization can be much faster.

Note that MinHash LSH and MinHash LSH Ensemble also support Redis and Cassandra storage layer (see MinHash LSH at Scale).

Install

To install datasketch using pip:

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

pip install datasketch[redis]

To install with Cassandra dependency:

pip install datasketch[cassandra]

To install with Scipy for faster MinHashLSH initialization:

pip install datasketch[scipy]

Project details


Release history Release notifications | RSS feed

This version

1.5.4

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasketch-1.5.4.tar.gz (76.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasketch-1.5.4-py2.py3-none-any.whl (74.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file datasketch-1.5.4.tar.gz.

File metadata

  • Download URL: datasketch-1.5.4.tar.gz
  • Upload date:
  • Size: 76.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.6

File hashes

Hashes for datasketch-1.5.4.tar.gz
Algorithm Hash digest
SHA256 b38294abd66ba97c533a5bfadab32a8ed7b32df5b34a867e8f422081abcb33ed
MD5 6bd1dc04bf51706492ed3fdb581eaea1
BLAKE2b-256 5d9901fae87541fca05f14ce71d3b2a09f3263d3e27fc93055a907c43ea990db

See more details on using hashes here.

File details

Details for the file datasketch-1.5.4-py2.py3-none-any.whl.

File metadata

  • Download URL: datasketch-1.5.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 74.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.0 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.6

File hashes

Hashes for datasketch-1.5.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b2c6df2a880f41c04293051694f20d96f59dad6406363ef97bd403b673d7464e
MD5 9fdd1cb428af1782237056710ac1cef9
BLAKE2b-256 80b8f0eff4ca56fa32a3736c69594e7a01adf2fa0487ce33147a50e7077e991c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page