Skip to main content

A library to compute histograms on distributed environments, on streaming data

Project description

https://badge.fury.io/py/distogram.svg Github WorkFlows Coverage Documentation Status https://mybinder.org/badge_logo.svg

DistoGram is a library that allows to compute histogram on streaming data, in distributed environments. The implementation follows the algorithms described in Ben-Haim’s Streaming Parallel Decision Trees

Get Started

First create a compressed representation of a distribution:

import numpy as np
import distogram

distribution = np.random.normal(size=10000)

# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
    h = distogram.update(h, i)

Compute statistics on the distribution:

nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))
count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427

Compute and display the histogram of the distribution:

hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()
docs/normal_histogram.png

Install

DistoGram is available on PyPi and can be installed with pip:

pip install distogram

Play With Me

You can test this library directly on this live notebook.

Performances

Distogram is design for fast updates when using python types. The following numbers show the results of the benchmark program located in the examples. It has been run on a modest 2014 13” macbook pro.

Interpreter

Operation

Numpy

Req/s

pypy 7.3

update

no

1290971

pypy 7.3

update

yes

27775

CPython 3.7

update

no

78809

CPython 3.7

update

yes

56906

As you can see, your are encouraged to use pypy with python native types. Pypy’s jit is penalised by numpy native types, causing a huge performance hit. Moreover the streaming phylosophy of Distogram is more adapted to python native types where numpy is optimized for batch computations, even with CPython.

Credits

Although this code has been written by following the aforementioned research paper, some parts are also inspired by the implementation from Carson Farmer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distogram-1.5.1.tar.gz (7.4 kB view details)

Uploaded Source

File details

Details for the file distogram-1.5.1.tar.gz.

File metadata

  • Download URL: distogram-1.5.1.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for distogram-1.5.1.tar.gz
Algorithm Hash digest
SHA256 3460713e96e2d37fdf957e5c2e484ab300980f264bb181859af8d3723b4a6210
MD5 df847b17d9495e2c9091f715aa26391c
BLAKE2b-256 a97d56300adbbe94dbf814bd34de92439db0cb5cede5c279c7b8f5ccdc31e448

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page