A library to compute histograms on distributed environments, on streaming data
Project description
==========
DistoGram
==========
.. image:: https://badge.fury.io/py/distogram.svg
:target: https://badge.fury.io/py/distogram
.. image:: https://github.com/maki-nage/distogram/workflows/Python%20package/badge.svg
:target: https://github.com/maki-nage/distogram/actions?query=workflow%3A%22Python+package%22
:alt: Github WorkFlows
.. image:: https://img.shields.io/codecov/c/github/maki-nage/distogram?style=plastic&color=brightgreen&logo=codecov&style=for-the-badge
:target: https://codecov.io/gh/maki-nage/distogram
:alt: Coverage
.. image:: https://readthedocs.org/projects/distogram/badge/?version=latest
:target: https://distogram.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: https://mybinder.org/badge_logo.svg
:target: https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb
DistoGram is a library that allows to compute histogram on streaming data, in
distributed environments. The implementation follows the algorithms described in
Ben-Haim's `Streaming Parallel Decision Trees
<http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf>`__
Get Started
============
First create a compressed representation of a distribution:
.. code:: python
import numpy as np
import distogram
distribution = np.random.normal(size=10000)
# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
h = distogram.update(h, i)
Compute statistics on the distribution:
.. code:: python
nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))
.. code:: console
count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427
Compute and display the histogram of the distribution:
.. code:: python
hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()
.. image:: docs/normal_histogram.png
:scale: 60%
:align: center
Install
========
DistoGram is available on PyPi and can be installed with pip:
.. code:: console
pip install distogram
Play With Me
============
You can test this library directly on this
`live notebook <https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb>`__.
Credits
========
Although this code has been written by following the aforementioned research
paper, some parts are also inspired by the implementation from
`Carson Farmer <https://github.com/carsonfarmer/streamhist>`__.
DistoGram
==========
.. image:: https://badge.fury.io/py/distogram.svg
:target: https://badge.fury.io/py/distogram
.. image:: https://github.com/maki-nage/distogram/workflows/Python%20package/badge.svg
:target: https://github.com/maki-nage/distogram/actions?query=workflow%3A%22Python+package%22
:alt: Github WorkFlows
.. image:: https://img.shields.io/codecov/c/github/maki-nage/distogram?style=plastic&color=brightgreen&logo=codecov&style=for-the-badge
:target: https://codecov.io/gh/maki-nage/distogram
:alt: Coverage
.. image:: https://readthedocs.org/projects/distogram/badge/?version=latest
:target: https://distogram.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: https://mybinder.org/badge_logo.svg
:target: https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb
DistoGram is a library that allows to compute histogram on streaming data, in
distributed environments. The implementation follows the algorithms described in
Ben-Haim's `Streaming Parallel Decision Trees
<http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf>`__
Get Started
============
First create a compressed representation of a distribution:
.. code:: python
import numpy as np
import distogram
distribution = np.random.normal(size=10000)
# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
h = distogram.update(h, i)
Compute statistics on the distribution:
.. code:: python
nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))
.. code:: console
count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427
Compute and display the histogram of the distribution:
.. code:: python
hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()
.. image:: docs/normal_histogram.png
:scale: 60%
:align: center
Install
========
DistoGram is available on PyPi and can be installed with pip:
.. code:: console
pip install distogram
Play With Me
============
You can test this library directly on this
`live notebook <https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb>`__.
Credits
========
Although this code has been written by following the aforementioned research
paper, some parts are also inspired by the implementation from
`Carson Farmer <https://github.com/carsonfarmer/streamhist>`__.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
distogram-1.5.0.tar.gz
(6.9 kB
view details)
File details
Details for the file distogram-1.5.0.tar.gz.
File metadata
- Download URL: distogram-1.5.0.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9f88c51e8057a41378a2b478c4ecfbe3e1470aa0a17827f304a52550af83565
|
|
| MD5 |
ae174e190ff9aa9b9b614ee9bbb848ff
|
|
| BLAKE2b-256 |
fd556fbc0435dc3257a72b35628d9af32388aa2e04a482bb2d236060f29d2f2a
|