Skip to main content

Histograms for Pandas/Spark/Numpy

Project description

histogrammar is a Python package for creating histograms. histogrammar has multiple histogram types, supports numeric and categorical features, and works with Numpy arrays and Pandas and Spark dataframes. Once a histogram is filled, it’s easy to plot it, store it in JSON format (and retrieve it), or convert it to Numpy arrays for further analysis.

At its core histogrammar is a suite of data aggregation primitives designed for use in parallel processing. In the simplest case, you can use this to compute histograms, but the generality of the primitives allows much more.

Several common histogram types can be plotted in Matplotlib and Bokeh with a single method call. If Numpy or Pandas is available, histograms and other aggregators can be filled from arrays ten to a hundred times more quickly via Numpy commands, rather than Python for loops.

This Python implementation of histogrammar been tested to guarantee compatibility with its Scala implementation.

Latest Python release: v1.1.2 (Sep 2025). Latest update: Sep 2025.

References

Histogrammar is a core component of popmon, a package by ING bank that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets, largely thanks to Histogrammar.

Announcements

Changes

See Changes log here.

Spark

With Spark, make sure to pick up the correct histogrammar jar files. Spark 4.X is based on Scala 2.13; Spark 3.X is based on Scala 2.12 or 2.13.

spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.13:1.0.30,io.github.histogrammar:histogrammar-sparksql_2.13:1.0.30").getOrCreate()

For Scala 2.12, in the string above simply replace “2.13” with “2.12”.

September, 2025

Example notebooks

Tutorial

Colab link

Basic tutorial

Open in Colab

Detailed example (featuring configuration, Apache Spark and more)

Open in Colab

Exercises

Open in Colab

Documentation

See histogrammar-docs for a complete introduction to histogrammar. (A bit old but still good.) There you can also find documentation about the Scala implementation of histogrammar.

Check it out

The historgrammar library requires Python 3.8+ and is pip friendly. To get started, simply do:

$ pip install histogrammar

or check out the code from our GitHub repository:

$ git clone https://github.com/histogrammar/histogrammar-python
$ pip install -e histogrammar-python

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import histogrammar

Congratulations, you are now ready to use the histogrammar library!

Quick run

As a quick example, you can do:

import pandas as pd
import histogrammar as hg
from histogrammar import resources

# open synthetic data
df = pd.read_csv(resources.data('test.csv.gz'), parse_dates=['date'])
df.head()

# create a histogram, tell it to look for column 'age'
# fill the histogram with column 'age' and plot it
hist = hg.Histogram(num=100, low=0, high=100, quantity='age')
hist.fill.numpy(df)
hist.plot.matplotlib()

# generate histograms of all features in the dataframe using automatic binning
# (importing histogrammar automatically adds this functionality to a pandas or spark dataframe)
hists = df.hg_make_histograms()
print(hists.keys())

# multi-dimensional histograms are also supported. e.g. features longitude vs latitude
hists = df.hg_make_histograms(features=['longitude:latitude'])
ll = hists['longitude:latitude']
ll.plot.matplotlib()

# store histogram and retrieve it again
ll.toJsonFile('longitude_latitude.json')
ll2 = hg.Factory().fromJsonFile('longitude_latitude.json')

These examples also work with Spark dataframes (sdf):

from pyspark.sql.functions import col
hist = hg.Histogram(num=100, low=0, high=100, quantity=col('age'))
hist.fill.sparksql(sdf)

For more examples please see the example notebooks and tutorials.

Project contributors

This package was originally authored by DIANA-HEP and is now maintained by volunteers.

Contact and support

Please note that histogrammar is supported only on a best-effort basis.

License

histogrammar is completely free, open-source and licensed under the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

histogrammar-1.1.2.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

histogrammar-1.1.2-py3-none-any.whl (200.8 kB view details)

Uploaded Python 3

File details

Details for the file histogrammar-1.1.2.tar.gz.

File metadata

  • Download URL: histogrammar-1.1.2.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for histogrammar-1.1.2.tar.gz
Algorithm Hash digest
SHA256 ffea5e62dc793a75188a2b5850bdeb963c305dd24d7a880669608bd4a15bf0ac
MD5 5a8bf9a4b51f17482472718e0004638b
BLAKE2b-256 cc4447bedf9949a50d4293c546fe1509e5a411ebf0ebffdb2a32043b4bac876d

See more details on using hashes here.

File details

Details for the file histogrammar-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: histogrammar-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 200.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.10

File hashes

Hashes for histogrammar-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4f4432b7a810a0d10aaa441bb51f6d6616f2f06b4c43f237713ee52fe641eb59
MD5 f0ac8a37d259ef9f5ef538e826588bbc
BLAKE2b-256 e156b149c40870cd40003c405fb4ea9e4ed65a1e1a89d5067537ac592bda7796

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page