Skip to main content

Probabilistic data structures for processing and searching very large datasets

Project description

https://static.pepy.tech/badge/datasketch/month https://zenodo.org/badge/DOI/10.5281/zenodo.598238.svg

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch

Usage

MinHash

estimate Jaccard similarity and cardinality

Weighted MinHash

estimate weighted Jaccard similarity

HyperLogLog

estimate cardinality

HyperLogLog++

estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index

For Data Sketch

Supported Query Type

MinHash LSH

MinHash, Weighted MinHash

Jaccard Threshold

MinHash LSH Forest

MinHash, Weighted MinHash

Jaccard Top-K

MinHash LSH Ensemble

MinHash

Containment Threshold

HNSW

Any

Custom Metric Top-K

datasketch must be used with Python 3.7 or above, NumPy 1.11 or above, and Scipy.

Note that MinHash LSH and MinHash LSH Ensemble also support Redis and Cassandra storage layer (see MinHash LSH at Scale).

Install

To install datasketch using pip:

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

pip install datasketch[redis]

To install with Cassandra dependency:

pip install datasketch[cassandra]

Project details


Release history Release notifications | RSS feed

This version

1.6.5

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasketch-1.6.5.tar.gz (92.6 kB view details)

Uploaded Source

Built Distribution

datasketch-1.6.5-py3-none-any.whl (89.2 kB view details)

Uploaded Python 3

File details

Details for the file datasketch-1.6.5.tar.gz.

File metadata

  • Download URL: datasketch-1.6.5.tar.gz
  • Upload date:
  • Size: 92.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for datasketch-1.6.5.tar.gz
Algorithm Hash digest
SHA256 ba2848cb74f23d6d3dd444cf24edcbc47b1c34a171b1803231793ed4d74d4fcf
MD5 4dcf9a37a1fd3126a4c863d45a51d875
BLAKE2b-256 882f248057ca4d22bd3ffb9bb3e9f4c208240a27e4d0ca9687d6d1d896aeec2a

See more details on using hashes here.

File details

Details for the file datasketch-1.6.5-py3-none-any.whl.

File metadata

  • Download URL: datasketch-1.6.5-py3-none-any.whl
  • Upload date:
  • Size: 89.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for datasketch-1.6.5-py3-none-any.whl
Algorithm Hash digest
SHA256 59311b2925b2f37536e9f7c2f46bbc25e8e54379c8635a3fa7ca55d2abb66d1b
MD5 7f6b21ced1bf0b646e1189cb34f900a2
BLAKE2b-256 8d24c8b0570c17c64e9d00485ac6f325c3a7ba19ea8b3385c73c85a26a519d77

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page