Skip to main content

Distributed quantile sketches

Project description

ddsketch

This repo contains the Python implementation of the distributed quantile sketch algorithm DDSketch [1]. DDSketch has relative-error guarantees for any quantile q in [0, 1]. That is if the true value of the qth-quantile is x then DDSketch returns a value y such that |x-y| / x < e where e is the relative error parameter. (The default here is set to 0.01.) DDSketch is also fully mergeable, meaning that multiple sketches from distributed systems can be combined in a central node.

Our default implementation, DDSketch, is guaranteed [1] to not grow too large in size for any data that can be described by a distribution whose tails are sub-exponential.

We also provide implementations (LogCollapsingLowestDenseDDSketch and LogCollapsingHighestDenseDDSketch) where the q-quantile will be accurate up to the specified relative error for q that is not too small (or large). Concretely, the q-quantile will be accurate up to the specified relative error as long as it belongs to one of the m bins kept by the sketch. If the data is time in seconds, the default of m = 2048 covers 80 microseconds to 1 year.

Installation

To install this package, run pip install ddsketch, or clone the repo and run python setup.py install. This package depends on numpy and protobuf. (The protobuf dependency can be removed if it's not applicable.)

Usage

from ddsketch import DDSketch

sketch = DDSketch()

Add values to the sketch

import numpy as np

values = np.random.normal(size=500)
for v in values:
  sketch.add(v)

Find the quantiles of values to within the relative error.

quantiles = [sketch.get_quantile_value(q) for q in [0.5, 0.75, 0.9, 1]]

Merge another DDSketch into sketch.

another_sketch = DDSketch()
other_values = np.random.normal(size=500)
for v in other_values:
  another_sketch.add(v)
sketch.merge(another_sketch)

The quantiles of values concatenated with other_values are still accurate to within the relative error.

Development

To work on ddsketch a Python interpreter must be installed. It is recommended to use the provided development container (requires docker) which includes all the required Python interpreters.

docker-compose run dev

Or, if developing outside of docker then it is recommended to use a virtual environment:

pip install virtualenv
virtualenv --python=3 .venv
source .venv/bin/activate

Testing

To run the tests install riot:

pip install riot

Replace the Python version with the interpreter(s) available.

# Run tests with Python 3.9
riot run -p3.9 test

Release notes

New features, bug fixes, deprecations and other breaking changes must have release notes included.

To generate a release note for the change:

riot run reno new <short-description-of-change-no-spaces>

Edit the generated file to include notes on the changes made in the commit/PR and add commit it.

Formatting

Format code with

riot run fmt

Type-checking

Type checking is done with mypy:

riot run mypy

Type-checking

Lint the code with flake8:

riot run flake8

Protobuf

The protobuf is stored in the go repository: https://github.com/DataDog/sketches-go/blob/master/ddsketch/pb/ddsketch.proto

Install the minimum required protoc and generate the Python code:

docker run -v $PWD:/code -it ubuntu:18.04 /bin/bash
apt update && apt install protobuf-compiler  # default is 3.0.0
protoc --proto_path=ddsketch/pb/ --python_out=ddsketch/pb/ ddsketch/pb/ddsketch.proto

Releasing

  1. Generate the release notes and use pandoc to format them for Github:
    git checkout master && git pull
    riot run -s reno report --no-show-source | pandoc -f rst -t gfm --wrap=none

Copy the output into a new release: https://github.com/DataDog/sketches-py/releases/new.

  1. Enter a tag for the release (following semver) (eg. v1.1.3, v1.0.3, v1.2.0).
  2. Use the tag without the v as the title.
  3. Save the release as a draft and pass the link to someone else to give a quick review.
  4. If all looks good hit publish

References

[1] Charles Masson and Jee E Rim and Homin K. Lee. DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. PVLDB, 12(12): 2195-2205, 2019. (The code referenced in the paper, including our implementation of the the Greenwald-Khanna (GK) algorithm, can be found at: https://github.com/DataDog/sketches-py/releases/tag/v0.1 )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ddsketch-3.0.1.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

ddsketch-3.0.1-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file ddsketch-3.0.1.tar.gz.

File metadata

  • Download URL: ddsketch-3.0.1.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for ddsketch-3.0.1.tar.gz
Algorithm Hash digest
SHA256 aa8f20b2965e61731ca4fee2ca9c209f397f5bbb23f9d192ec8bd7a2f5bd9824
MD5 29a5915967ceb6a80fcc79f5d4e8553b
BLAKE2b-256 b8c725f300ba359c7e723180ce962a30e1f820c3990e3f3e8bbed16ae9387cab

See more details on using hashes here.

File details

Details for the file ddsketch-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: ddsketch-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for ddsketch-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d047b455fe2837c43d366ff1ae6ba0c3166e15499de8688437a75cea914224e
MD5 608dd612d2be0deac714a664610772a7
BLAKE2b-256 acdac821e4958c8df43ded1a92aaca678d89ec8b7a4df5bb561ef25354be1912

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page