Skip to main content

A library to compute histograms on distributed environments, on streaming data

Project description

==========
DistoGram
==========


.. image:: https://badge.fury.io/py/distogram.svg
:target: https://badge.fury.io/py/distogram

.. image:: https://github.com/maki-nage/distogram/workflows/Python%20package/badge.svg
:target: https://github.com/maki-nage/distogram/actions?query=workflow%3A%22Python+package%22
:alt: Github WorkFlows

.. image:: https://img.shields.io/codecov/c/github/maki-nage/distogram?style=plastic&color=brightgreen&logo=codecov&style=for-the-badge
:target: https://codecov.io/gh/maki-nage/distogram
:alt: Coverage

.. image:: https://readthedocs.org/projects/distogram/badge/?version=latest
:target: https://distogram.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

.. image:: https://mybinder.org/badge_logo.svg
:target: https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb


DistoGram is a library that allows to compute histogram on streaming data, in
distributed environments. The implementation follows the algorithms described in
Ben-Haim's `Streaming Parallel Decision Trees
<http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf>`__

Get Started
============

First create a compressed representation of a distribution:

.. code:: python

import numpy as np
import distogram

distribution = np.random.normal(size=10000)

# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
h = distogram.update(h, i)


Compute statistics on the distribution:

.. code:: python

nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))


.. code:: console

count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427

Compute and display the histogram of the distribution:

.. code:: python

hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()

.. image:: docs/normal_histogram.png
:scale: 60%
:align: center

Install
========

DistoGram is available on PyPi and can be installed with pip:

.. code:: console

pip install distogram


Play With Me
============

You can test this library directly on this
`live notebook <https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb>`__.

Credits
========

Although this code has been written by following the aforementioned research
paper, some parts are also inspired by the implementation from
`Carson Farmer <https://github.com/carsonfarmer/streamhist>`__.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distogram-1.5.0.tar.gz (6.9 kB view details)

Uploaded Source

File details

Details for the file distogram-1.5.0.tar.gz.

File metadata

  • Download URL: distogram-1.5.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for distogram-1.5.0.tar.gz
Algorithm Hash digest
SHA256 b9f88c51e8057a41378a2b478c4ecfbe3e1470aa0a17827f304a52550af83565
MD5 ae174e190ff9aa9b9b614ee9bbb848ff
BLAKE2b-256 fd556fbc0435dc3257a72b35628d9af32388aa2e04a482bb2d236060f29d2f2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page