distogram·PyPI

A library to compute histograms on distributed environments, on streaming data

These details have not been verified by PyPI

Project links

repository

Project description

DistoGram is a library that allows to compute histogram on streaming data, in distributed environments. The implementation follows the algorithms described in Ben-Haim’s Streaming Parallel Decision Trees

Get Started

First create a compressed representation of a distribution:

import numpy as np
import distogram

distribution = np.random.normal(size=10000)

# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
    h = distogram.update(h, i)

Compute statistics on the distribution:

nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))

count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427

Compute and display the histogram of the distribution:

hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()

Install

DistoGram is available on PyPi and can be installed with pip:

pip install distogram

Play With Me

You can test this library directly on this live notebook.

Performances

Distogram is design for fast updates when using python types. The following numbers show the results of the benchmark program located in the examples.

On a i7-9800X Intel CPU, performances are:

Interpreter	Operation	Numpy	Req/s
pypy 7.3	update	no	6563311
pypy 7.3	update	yes	111318
CPython 3.7	update	no	436709
CPython 3.7	update	yes	251603

On a modest 2014 13” macbook pro, performances are:

Interpreter	Operation	Numpy	Req/s
pypy 7.3	update	no	3572436
pypy 7.3	update	yes	37630
CPython 3.7	update	no	112749
CPython 3.7	update	yes	81005

As you can see, your are encouraged to use pypy with python native types. Pypy’s jit is penalised by numpy native types, causing a huge performance hit. Moreover the streaming phylosophy of Distogram is more adapted to python native types while numpy is optimized for batch computations, even with CPython.

Credits

Although this code has been written by following the aforementioned research paper, some parts are also inspired by the implementation from Carson Farmer.

Thanks to John Belmonte for his help on performances and accuracy improvements.

Project details

These details have not been verified by PyPI

Project links

repository

Release history Release notifications | RSS feed

This version

3.0.3

Mar 10, 2025

3.0.0

Feb 5, 2022

2.0.0

Aug 25, 2021

1.6.0

Jun 21, 2020

1.5.1

Jun 7, 2020

1.5.0

Jun 5, 2020

1.4.0

Jun 4, 2020

1.3.0

Jun 3, 2020

1.2.0

May 13, 2020

1.1.0

May 10, 2020

1.0.0

May 9, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distogram-3.0.3.tar.gz (8.9 kB view details)

Uploaded Mar 10, 2025 Source

File details

Details for the file distogram-3.0.3.tar.gz.

File metadata

Download URL: distogram-3.0.3.tar.gz
Upload date: Mar 10, 2025
Size: 8.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for distogram-3.0.3.tar.gz
Algorithm	Hash digest
SHA256	`5ccedf0c4d01f6373448e4570f9d994f1de92c1384fe47ce78a95c94c3919fac`
MD5	`006e63f07266bca2df86d00e0d5e1785`
BLAKE2b-256	`2c380e2f615dce6ed4168782ad84ac3373d80c43b47465ecfd193318f05c0a68`

See more details on using hashes here.

distogram 3.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Get Started

Install

Play With Me

Performances

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes