Skip to main content

Probabilistic data structures for processing and searching very large datasets

Project description

https://github.com/ekzhu/datasketch/workflows/Python%20package/badge.svg https://zenodo.org/badge/DOI/10.5281/zenodo.290602.svg

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch Usage
MinHash estimate Jaccard similarity and cardinality
Weighted MinHash estimate weighted Jaccard similarity
HyperLogLog estimate cardinality
HyperLogLog++ estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index For Data Sketch Supported Query Type
MinHash LSH MinHash, Weighted MinHash Jaccard Threshold
MinHash LSH Forest MinHash, Weighted MinHash Jaccard Top-K
MinHash LSH Ensemble MinHash Containment Threshold

datasketch must be used with Python 2.7 or above, NumPy 1.11 or above, and Scipy.

Note that MinHash LSH and MinHash LSH Ensemble also support Redis and Cassandra storage layer (see MinHash LSH at Scale).

Install

To install datasketch using pip:

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

pip install datasketch[redis]

To install with Cassandra dependency:

pip install datasketch[cassandra]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for datasketch, version 1.5.6
Filename, size File type Python version Upload date Hashes
Filename, size datasketch-1.5.6.tar.gz (77.9 kB) File type Source Python version None Upload date Hashes View
Filename, size datasketch-1.5.6-py2.py3-none-any.whl (75.6 kB) File type Wheel Python version py2.py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page