A smarter histogram

Project description

shmistogram

Shmistograms are generalizations of histograms that emphasize

singular modalities (i.e. point masses), when they exist, as well as null values.
accurate estimation, viewing the histogram as a 1-D density estimator
removing the usual constraint that all bins be of the same width, allowing visualization of higher fidelity with fewer bins.

The package also includes a variety of 1-D data summarization tools.

Example

Let's simulate draws from a triangular distribution (the 'crowd'), supplemented with a couple of mode points ('loners'), and some null values:

from matplotlib import pyplot as plt
import numpy as np
import shmistogram as sh

# Simulate a mixture of a uniform distribution mixed with a few point masses
rng = np.random.default_rng(seed=1)
crowd = rng.triangular(-10, -10, 70, size=500)
loners = np.array([0]*40 + [42]*20)
null = np.array([np.nan]*100)
data = np.concatenate((crowd, loners, null))

fig, axes = plt.subplots(1, 2)

# Build a standard histogram with matplotlib.pyplot.hist defaults
sh.plot.standard_histogram(data[~np.isnan(data)], ax=axes[0], name='mixed data')

# Build a shmistogram
shm = sh.Shmistogram(data)
shm.plot(ax=axes[1], name='mixed data')

fig.tight_layout()

The histogram obscures the point masses somewhat and says nothing about missing values. By contrast, the shmistogram uses red line segments to emphasize the point masses, and the legend bar highlights the relative portions of the data in the crowd versus the point masses versus the null values.

Installation

We're on pypi, so pip install shmistogram.

Test your installation by running demo.py

Consider using the simplest-possible virtual environment if working directly on this repo.

Details

Default behavior

Given a 1-D array of numeric (or np.nan) values data, the shmistogram shmistogram.Shmistogram(data)

counts every unique value
splits the data into as many as 3 subsets:
- np.nan
- "Loners" are points with a count above the threshold set by the argument loner_min_count. Shmistogram sets this dynamically by default as a somewhat log-linear function of len(data). With 100 points, the threshold is 8; with 100,000 it is 18.
- The "crowd" is all remaining points.
bins the "crowd" using a density estimation tree.

Calling the plot method on the resulting object displays all components of the distribution on a single figure.

Why shmistogram?

Use case 1: Exploratory data analysis

A shmistogram can be more informative than a histogram by separating continous and discrete variation:

inconsistent rounding any continuous variable can induce a mixture of point masses and relatively continuous observations
"age of earning first driver's license" plausibly has structural modes at the legal minimum (which may vary by state) and otherwise vary continuously

Use case 2: Scalable, generative density estimation

The shmistogram scales approximately as O(n log(n)) with default settings (see speed_testing.ipynb). The resulting density model is easy to sample from, as a mixture of a piecewise uniform distribution and a multinomial distribution. Such a simple estimator works well as one of the required inputs of the CADE density estimation algorithm for high dimensional and mixed continuous/categorical data (see pydens).

The shmistogram's adaptive bin width leads to a higher-fidelity representation of complicated distributions without substantially increasing the number of bins. This is not a new idea, and shmistogram wraps multiple binning methods that the user can choose from. See binning_methods.ipynb for details.

Binning

The default binning algorithm uses a binary density estimation tree to iteratively split the data into smaller bins. The split location (within a bin/leaf) maximizes a penalized improvement in the deviance (i.e. in-sample negative log likelihood). The penalty reflects

a hard min_data_in_leaf constraint. This minimum currently defaults to 3
a soft penalty on bins with few observations

We choose the bin to split on as the bin for which splitting produces the greatest penalized improvement. Splits proceed as long as the deviance improvement exceeds the number of leaves. This approach is inspired by the Akaike information criterion (AIC), although this may be an abuse of the criterion in the sense that we're using it as part of a greedy iterative procedure instead of using it to compare fully-formed models.

The variable-width binning algorithms of bayesian block representations provide an alternative to our default binning algorithm. See demo for an example. See also Python Perambulations for a light conceptual introduction to Bayesian blocks.

Wishlist

Clarify the objective: There is a tension between optimizing a binner for (a) visualization purposes, such as avoiding tall narrow bins to minimize white space, or adjusting the average bin width to tell a particular story and (b) minimizing a formal measure of estimation accuracy such as the expectation of deviance (taken over future observations from the true distribution). We should offer guidance on which binning method tends to be most effective for each of these goals.

Optimize speed for the default method. Scalability is a big part of the motivation for such a simple model, but the current implementation is far from optimal.

Compare/contrast/harmonize our binning methods with the literature:

density estimation trees such as this
distribution element trees such as detpack. See detpack_example.R for a simple variable-width binner.
Efficient Density Estimation via Piecewise Polynomial Approximation.

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Jan 12, 2025

0.6.3

Jan 3, 2025

0.6.2

Dec 23, 2024

0.6.0

Dec 22, 2024

0.4.1

Dec 15, 2024

0.4.0

Dec 15, 2024

0.3.2

Dec 15, 2024

0.3.1

Dec 14, 2024

0.2.4

Apr 13, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shmistogram-1.0.0.tar.gz (15.2 kB view details)

Uploaded Jan 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shmistogram-1.0.0-py3-none-any.whl (19.3 kB view details)

Uploaded Jan 12, 2025 Python 3

File details

Details for the file shmistogram-1.0.0.tar.gz.

File metadata

Download URL: shmistogram-1.0.0.tar.gz
Upload date: Jan 12, 2025
Size: 15.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for shmistogram-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a4d1a1950f1cf54f09a39761bdb7cbf5c14a20e1fad9d88226e5551f9d0ee8cc`
MD5	`6faf03b27a98539575b483824beb499d`
BLAKE2b-256	`3eac8489c203f420856ed4bc82ff12f19b4483ba92404d334cb887e15af2e744`

See more details on using hashes here.

File details

Details for the file shmistogram-1.0.0-py3-none-any.whl.

File metadata

Download URL: shmistogram-1.0.0-py3-none-any.whl
Upload date: Jan 12, 2025
Size: 19.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for shmistogram-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4de06122d0e241b2a94518c7cc8866c721aa2050d30645fa66c738e6542cf078`
MD5	`e25ffaf4fad602ffee9caf05460effb1`
BLAKE2b-256	`cb3db85ce3ab2c6c0161fdae346ffeb849f56dae4a706f1cc83f50333f1ab30d`

See more details on using hashes here.

shmistogram 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

shmistogram

Example

Installation

Details

Default behavior

Why shmistogram?

Use case 1: Exploratory data analysis

Use case 2: Scalable, generative density estimation

Binning

Wishlist

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes