Skip to main content

Generalized additive models with a Bayesian twist

Project description

Gammy – Generalized additive models in Python with a Bayesian twist

A Generalized additive model is a predictive mathematical model defined as a sum of terms that are calibrated (fitted) with observation data. This package provides a hopefully pleasant interface for configuring and fitting such models. Bayesian interpretation of model parameters is promoted and minimalist set of features.

Summary

Generalized additive models form a surprisingly general framework for building models for both production software and scientific research. This Python package offers tools for building the model terms as decompositions of various basis functions. It is possible to model the terms e.g. as Gaussian processes (with reduced dimensionality) of various kernels, as piecewise linear functions, and as B-splines, among others. Of course, very simple terms like lines and constants are also supported (these are just very simple basis functions).

The uncertainty in the weight parameter distributions is modeled using Bayesian statistic with the help of the superb package BayesPy.

The work is on an early stage, so many features are still missing.

Other projects with GAM functionalities

Table of Contents

Examples

Polynomial regression on 'roids

Start with very simple dataset

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import gammy
from gammy.utils import pipe

# NOTE: Used repetitively in defining model terms!
from gammy.arraymapper import x


np.random.seed(42)


# Define dummy data
n = 30
input_data = 10 * np.random.rand(n)
y = 5 * input_data + 2.0 * input_data ** 2 + 7 + 10 * np.random.randn(n)

The object x is just a convenience tool for defining input data maps as if they were just Numpy arrays.

# Define model
a = gammy.Scalar(prior=(0, 1e-6))
b = gammy.Scalar(prior=(0, 1e-6))
bias = gammy.Scalar(prior=(0, 1e-6))
formula = a * x + b * x ** 2 + bias
model = gammy.BayesianGAM(formula).fit(input_data, y)

The model attribute model.theta characterizes the Gaussian posterior distribution of the model parameters vector.

Predicting with model

model.predict(input_data[:3])
# array([  99.25493083,   23.31063443,  226.70702106])

Predictions with uncertainty can be calculated as follows (scale=2.0 roughly corresponds to the 95% confidence interval):

model.predict_total_uncertainty(input_data[:3], scale=2.0)
# (array([ 97.3527439 ,  77.79515549,  59.88285762]),
#  array([ 2.18915289,  2.19725385,  2.18571614]))

Plotting results

# Plot results
fig = gammy.plot.validation_plot(
    model,
    input_data,
    y,
    grid_limits=[0, 10],
    input_maps=[x, x, x],
    titles=["a", "b", "bias"]
)

The grey band in the top figure is two times the prediction standard deviation and, in the partial residual plots, two times the respective marginal posterior standard deviation.

alt text

It is also possible to plot the estimated Γ-distribution of the noise precision (inverse variance) as well as the 1-D Normal distributions of each individual model parameter.

# Plot parameter probability density functions
fig = gammy.plot.gaussian1d_density_plot(model, grid_limits=[-1, 3])

alt text

Saving model on hard disk for later use (HDF5)

Saving

model.save("/home/foobar/test.hdf5")

Loading

model = BayesianGAM(formula).load("/home/foobar/test.hdf5")

Gaussian process regression ("kriging")

# Create some data
n = 50
input_data = np.vstack((2 * np.pi * np.random.rand(n), np.random.rand(n))).T
y = (
    np.abs(np.cos(input_data[:, 0])) * input_data[:, 1] 
    + 1 + 0.1 * np.random.randn(n)
)


# Define model
a = gammy.ExpSineSquared1d(
    np.arange(0, 2 * np.pi, 0.1),
    corrlen=1.0,
    sigma=1.0,
    period=2 * np.pi,
    energy=0.99
)
bias = gammy.Scalar(prior=(0, 1e-6))
formula = a(x[:, 0]) * x[:, 1] + bias
model = gammy.BayesianGAM(formula).fit(input_data, y)


# Plot results
fig = gammy.plot.validation_plot(
    model,
    input_data,
    y,
    grid_limits=[[0, 2 * np.pi], [0, 1]],
    input_maps=[x[:, 0:2], x[:, 1]],
    titles=["a", "intercept"]
)


# Plot parameter probability density functions
fig = gammy.plot.gaussian1d_density_plot(model, grid_limits=[-1, 3])

alt text alt text

More kernel functions for GPs

input_data = np.arange(0, 1, 0.01)

# Staircase function with 5 steps from 0...1
y = reduce(lambda u, v: u + v, [
    1.0 * (input_data > c) for c in [0, 0.2, 0.4, 0.6, 0.8]
])

grid = np.arange(0, 1, 0.001)
corrlen = 0.01
sigma = 2
a = gammy.ExpSquared1d(
    grid=grid,
    corrlen=corrlen,
    sigma=sigma,
    energy=0.999
)(x)
b = gammy.RationalQuadratic1d(
    grid=grid,
    corrlen=corrlen,
    alpha=1,
    sigma=sigma,
    energy=0.99
)(x)
c = gammy.OrnsteinUhlenbeck1d(
    grid=grid,
    corrlen=corrlen,
    sigma=sigma,
    energy=0.99
)(x)

exponential_squared = gammy.BayesianGAM(a).fit(input_data, y)
rational_quadratic = gammy.BayesianGAM(b).fit(input_data, y)
ornstein_uhlenbeck = gammy.BayesianGAM(c).fit(input_data, y)
# Plot boilerplate ...

alt text

Define custom kernel

It is straightforward to define custom formulas from "positive semidefinite" covariance kernel functions.

def kernel(x1, x2):
    """Kernel for min(x, x')

    """
    r = lambda t: t.repeat(*t.shape)
    return np.minimum(r(x1), r(x2).T)


grid = np.arange(0, 1, 0.001)

Minimum = gammy.create_from_kernel1d(kernel)
a = Minimum(grid=grid, energy=0.999)(x)

# Let's compare to exp squared
b = gammy.ExpSquared1d(grid=grid, corrlen=0.05, sigma=1, energy=0.999)(x)


def sample(X):
    return np.dot(X, np.random.randn(X.shape[1]))


ax = plt.figure().gca()
ax.plot(grid, sample(a.build_X(grid)), label="Custom")
ax.plot(grid, sample(b.build_X(grid)), label="Exp. squared")
ax.legend()

alt text

Multivariate Gaussian process regression

In this example we construct a basis corresponding to a multi-variate Gaussian process with a Kronecker structure (see e.g. PyMC3).

Let us first create some artificial data using the MATLAB function!

# Create some data
n = 100
input_data = np.vstack((6 * np.random.rand(n) - 3, 6 * np.random.rand(n) - 3)).T


def peaks(x, y):
    """The MATLAB function

    """
    return (
        3 * (1 - x) ** 2 * np.exp(-(x ** 2) - (y + 1) ** 2) -
        10 * (x / 5 - x ** 3 - y ** 5) * np.exp(-x ** 2 - y ** 2) -
        1 / 3 * np.exp(-(x + 1) ** 2 - y ** 2)
    )


y = peaks(input_data[:, 0], input_data[:, 1]) + 4 + 0.3 * np.random.randn(n)

There is support for forming two-dimensional basis functions given two one-dimensional formulas. The new combined basis is essentially the outer product of the given bases. The underlying weight prior distribution priors and covariances are constructed using the Kronecker product.

# Define model
a = gammy.ExpSquared1d(
    np.arange(-3, 3, 0.1),
    corrlen=0.5,
    sigma=4.0,
    energy=0.99
)(x[:, 0])  # note that we need to define the input map at this point!
b = gammy.ExpSquared1d(
    np.arange(-3, 3, 0.1),
    corrlen=0.5,
    sigma=4.0,
    energy=0.99
)(x[:, 1]) # note that we need to define the input map at this point!
A = gammy.Kron(a, b)
bias = gammy.Scalar(prior=(0, 1e-6))
formula = A + bias
model = gammy.BayesianGAM(formula).fit(input_data, y)

Note that same logic could be used to construct higher dimensional bases, that is, one could define

# 3-D formula
formula = gammy.kron(gammy.kron(a, b), c)

Finally, plot results.

# Plot results
fig = gammy.plot.validation_plot(
    model,
    input_data,
    y,
    grid_limits=[[-3, 3], [-3, 3]],
    input_maps=[x, x[:, 0]],
    titles=["A", "intercept"]
)


# Plot parameter probability density functions
fig = gammy.plot.gaussian1d_density_plot(model, grid_limits=[-1, 3])

alt text alt text

The original function can be plotted like so

from mpl_toolkits.mplot3d import Axes3D


X, Y = np.meshgrid(np.linspace(-3, 3, 100), np.linspace(-3, 3, 100))
Z = peaks(X, Y) + 4

fig = plt.figure()
ax = fig.gca(projection="3d")
ax.plot_surface(X, Y, Z, color="r", antialiased=False)

alt text

B-Spline basis

Constructing B-Spline based 1-D basis functions is also supported.

# Define dummy data
n = 30
input_data = 10 * np.random.rand(n)
y = 2.0 * input_data ** 2 + 7 + 10 * np.random.randn(n)


# Define model
a = gammy.Scalar(prior=(0, 1e-6))

grid = np.arange(0, 11, 2.0)
order = 2
N = len(grid) + order - 2
sigma = 10 ** 2
a = gammy.BSpline1d(
    grid,
    order=order,
    prior=(np.zeros(N), np.identity(N) / sigma),
    extrapolate=True
)
formula = a(x)
model = gammy.BayesianGAM(formula).fit(input_data, y)

# Plot results
fig = gammy.plot.validation_plot(
    model,
    input_data,
    y,
    grid_limits=[-2, 12],
    input_maps=[x],
    titles=["a"]
)


# Plot parameter probability density functions
fig = gammy.plot.gaussian1d_density_plot(model, grid_limits=[-1, 3])

alt text alt text

To-be-added features

  • TODO Quick model template functions (e.g. splines, GPs)
  • TODO Shorter overview and examples in README. Other docs inside docs.
  • TODO Support indicator models in plotting
  • TODO Fixed ordering for GP related basis functions.
  • TODO Hyperpriors for model parameters – Start from diagonal precisions. Instead of (μ, Λ) pairs, the arguments could be just BayesPy node.
  • TODO Add simple Numpy-based linear solver for MAP estimates.
  • TODO Support non-linear GAM models.
  • TODO Multi-dimensional observations.
  • TODO Dynamically changing models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

gammy-0.3.4-py2.py3-none-any.whl (18.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page