Skip to main content

A library of information-theoretic methods

Project description

pyitlib is an MIT-licensed library of information-theoretic methods for data analysis and machine learning, implemented in Python and NumPy.

API documentation is available online at https://pafoster.github.io/pyitlib/.

pyitlib implements the following 19 measures on discrete random variables:

  • Entropy

  • Joint entropy

  • Conditional entropy

  • Cross entropy

  • Kullback-Leibler divergence

  • Symmetrised Kullback-Leibler divergence

  • Jensen-Shannon divergence

  • Mutual information

  • Normalised mutual information (7 variants)

  • Variation of information

  • Lautum information

  • Conditional mutual information

  • Co-information

  • Interaction information

  • Multi-information

  • Binding information

  • Residual entropy

  • Exogenous local information

  • Enigmatic information

The following estimators are available for each of the measures:

  • Maximum likelihood

  • Maximum a posteriori

  • James-Stein

  • Good-Turing

Missing data are supported, either using placeholder values or NumPy masked arrays.

Installation and codebase

pyitlib is listed on the Python Package Index at https://pypi.python.org/pypi/pyitlib/ and may be installed using pip as follows:

pip install pyitlib

The codebase for pyitlib is available at https://github.com/pafoster/pyitlib.

Notes for getting started

Import the module discrete_random_variable, as well as NumPy:

import numpy as np
from pyitlib import discrete_random_variable as drv

The respective methods implemented in discrete_random_variable accept NumPy arrays as input. Let’s compute the entropy for an array containing discrete random variable realisations, based on maximum likelihood estimation and quantifying entropy in bits:

>>> X = np.array((1,2,1,2))
>>> drv.entropy(X)
array(1.0)

NumPy arrays are created automatically for any input which isn’t of the required type, by passing the input to np.array(). Let’s compute entropy, again based on maximum likelihood estimation, but this time using list input and quantifying entropy in nats:

>>> drv.entropy(['a', 'b', 'a', 'b'], base=np.exp(1))
array(0.6931471805599453)

Those methods with the suffix _pmf operate on arrays specifying probability mass assignments. For example, the analogous method call for computing the entropy of the preceding random variable realisations (with estimated equi-probable outcomes) is:

>>> drv.entropy_pmf([0.5, 0.5], base=np.exp(1))
0.69314718055994529

It’s possible to specify missing data using placeholder values (the default placeholder value is -1). Elements equal to the placeholder value are subsequently ignored:

>>> drv.entropy([1, 2, 1, 2, -1])
array(1.0)

In measures expressible in terms of joint entropy (such as conditional entropy, mutual information etc.), equally many realisations of respective random variables are required (with realisations coupled using a common index). Any missing data for random variable X results in the corresponding realisations for random variable Y being ignored, and vice versa. Thus, the following method calls yield equivalent results (note use of alternative placeholder value None):

>>> drv.entropy_conditional([1,2,2,2], [1,1,2,2])
array(0.5)
>>> drv.entropy_conditional([1,2,2,2,1], [1,1,2,2,None], fill_value=None)
array(0.5)

It’s alternatively possible to specify missing data using NumPy masked arrays:

>>> Z = np.ma.array((1,2,1), mask=(0,0,1))
>>> drv.entropy(Z)
array(1.0)

In combination with any estimator other than maximum likelihood, it may be useful to specify alphabets containing unobserved outcomes. For example, we might seek to estimate the entropy in bits for the sequence of realisations [1,1,1,1]. Using maximum a posteriori estimation combined with the Perks prior (i.e. pseudo-counts of 1/L for each of L possible outcomes) and based on an alphabet specifying L=100 possible outcomes, we may use:

>>> drv.entropy([1,1,1,1], estimator='PERKS', Alphabet_X = np.arange(100))
array(2.030522626645241)

Multi-dimensional array input is supported based on the convention that leading dimensions index random variables, with the trailing dimension indexing random variable realisations. Thus, the following array specifies realisations for 3 random variables:

>>> X = np.array(((1,1,1,1), (1,1,2,2), (1,1,2,2)))
>>> X.shape
(3, 4)

When using multi-dimensional arrays, any alphabets must be specified separately for each random variable represented in the multi-dimensional array, using placeholder values (or NumPy masked arrays) to pad out any unequally sized alphabets:

>>> drv.entropy(X, estimator='PERKS', Alphabet_X = np.tile(np.arange(100),(3,1))) # 3 alphabets required
array([ 2.03052263,  2.81433872,  2.81433872])

>>> A = np.array(((1,2,-1), (1,2,-1), (1,2,3))) # padding required
>>> drv.entropy(X, estimator='PERKS', Alphabet_X = A)
array([ 0.46899559,  1.        ,  1.28669267])

For ease of use, those methods operating on two random variable array arguments (such as entropy_conditional, information_mutual etc.) may be invoked with a single multi-dimensional array. In this way, we may compute mutual information for all pairs of random variables represented in the array as follows:

>>> drv.information_mutual(X)
array([[ 0.,  0.,  0.],
       [ 0.,  1.,  1.],
       [ 0.,  1.,  1.]])

The above is equivalent to setting the cartesian_product parameter to True and specifying two random variable array arguments explicitly:

>>> drv.information_mutual(X, X, cartesian_product=True)
array([[ 0.,  0.,  0.],
       [ 0.,  1.,  1.],
       [ 0.,  1.,  1.]])

By default, those methods operating on several random variable array arguments don’t determine all combinations of random variables exhaustively. Instead a one-to-one mapping is performed:

>>> drv.information_mutual(X, X) # Mutual information between 3 pairs of random variables
array([ 0.,  1.,  1.])

>>> drv.entropy(X) # Mutual information equivalent to entropy in above case
array([ 0.,  1.,  1.])

pyitlib provides basic support for pandas DataFrames/Series. Both these types are converted to NumPy masked arrays internally, while masking those data recorded as missing (based on .isnull()). Note that due to indexing random variable realisations using the trailing dimension of multi-dimensional arrays, we typically need to transpose DataFrames when estimating information-theoretic quantities:

>>> import pandas
>>> df = pandas.read_csv('https://raw.githubusercontent.com/veekun/pokedex/master/pokedex/data/csv/pokemon.csv')
>>> df = df[['height', 'weight', 'base_experience']].apply(lambda s: pandas.qcut(s, 10, labels=False)) # Bin the data
>>> drv.information_mutual_normalised(df.T) # Transposition required for comparing columns
array([[ 1.        ,  0.32472696,  0.17745753],
       [ 0.32729034,  1.        ,  0.13343504],
       [ 0.17848175,  0.13315407,  1.        ]])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyitlib-0.2.3.tar.gz (30.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page