Skip to main content

Package for performing Bayesian inference with insufficient and robust statistics.

Project description

Insufficient Gibbs Sampling

arXiv Licence PyPI version PyPI - Downloads

image

The InsufficientGibbs package offers functionalities for sampling from posterior parameters when only robust and insufficient statistics of the data are available. A comprehensive explanation of the underlying theory can be found in the associated paper on the arXiv. Additionally, the code responsible for generating all figures presented in the paper is located in the figures folder.

Table of contents

AbstractInstallTutorialExamplesAdd new distributions/models

Abstract

In some applied scenarios, the availability of complete data is restricted, often due to privacy concerns; only aggregated, robust and inefficient statistics derived from the data are made accessible. These robust statistics are not sufficient, but they demonstrate reduced sensitivity to outliers and offer enhanced data protection due to their higher breakdown point. We consider a parametric framework and propose a method to sample from the posterior distribution of parameters conditioned on various robust and inefficient statistics: specifically, the pairs (median, MAD) or (median, IQR), or a collection of quantiles. Our approach leverages a Gibbs sampler and simulates latent augmented data, which facilitates simulation from the posterior distribution of parameters belonging to specific families of distributions. A by-product of these samples from the joint posterior distribution of parameters and data given the observed statistics is that we can estimate Bayes factors based on observed statistics via bridge sampling. We validate and outline the limitations of the proposed methods through toy examples and an application to real-world income data.

InsufficientGibbs package

Install

Install latest release via pip

pip install InsufficientGibbs

For latest development version clone the repository and install via pip

git clone https://github.com/antoineluciano/Insufficient-Gibbs-Sampling
cd src/InsufficientGibbs
pip install .

Available distributions

  • Normal (Normal)
  • Cauchy (Cauchy)
  • Laplace(Laplace)
  • Gamma (Gamma) and its translated version (TranslatedGamma)
  • LogNormal (LogNormal) and its translated version (TranslatedLogNormal)
  • Weibull (Weibull) and its translated version (TranslatedWeibull)
  • Generalized Pareto (GeneralizedPareto)

Add Model to the end of their function names to use them as models.

Tutorial

Create prior distributions

For each parameter of your model, create a variable of type Distribution representing its associated prior distribution. You must specify its hyperparameters and optionally its name.

Create the model

Create a variable of type Model with your selected distribution and its predefined parameters as arguments.

Sample from the posterior given insufficient statistics

To sample from the posterior of the model parameters given the observed statistics, you can use the three functions of the class Model:

  • Gibbs_Quantile: the observed data consists of a sequence of quantiles.
  • Gibbs_med_IQR: the observed data consists of the median and the interquartile range (IQR).
  • Gibbs_med_MAD: the observed data consists of the median and the median absolute deviation (MAD).

Examples

Cauchy distribution with Normal and Gamma priors

from InsufficientGibbs.Distribution import Normal, InverseGamma
from InsufficientGibbs.Models import CauchyModel

# Creating the prior distributions variables
mu = Normal(0,10, name= "x_0")
sigma = Gamma(2,2, name = "gamma")

# Creating the model variable
model = CauchyModel(mu,sigma)

T, N = 10000, 100

# Quantile case 
q1, med, q3 = -1, 0, 1
probs = [.25, .5, .75]
Cauchy_Q = model.Gibbs_Quantiles(T, N, [q1, med, q3], probs)

med, IQR = 0, 2
Cauchy_med_IQR = model.Gibbs_med_IQR(T, N, med, IQR) 

# Median, MAD case
med, MAD = 0, 1
Cauchy_med_MAD = model.Gibbs_med_MAD(T, N, med, MAD)

You can display the chain by using the display_chains function.

How to add new distributions and models

As outlined in our paper, our method is applicable to all continuous distributions with compact support, as it only requires simulation according to a truncated version. Therefore, we provide guidance on how users can seamlessly integrate their own models by simply adding instances to the Distribution and Model classes.

For the Distribution class, an instance should include an initialization function __init__defining a variable _distribution containing all relevant distribution information (e.g., pdf, cdf, ppf) and a domain function describing the domain of definition.

Similarly, for the Model class, an instance should include an initialization function __init__, along with functions returning initialization parameters for the three studied scenarios (Init_theta_Quantile, Init_theta_med_MAD, and Init_theta_med_IQR).

To illustrate, we present examples of distributions. First, we showcase a distribution already implemented in SciPy (the Pareto distribution), followed by a custom implementation (the Pareto Type II distribution).

Example of the Pareto Distribution

Add the the instance Pareto of the class Distribution in the file Distribution.py:

from scipy.stats import pareto

class Pareto(Distribution):
    """
    Container for Pareto Distribution

    Parameters
    ----------
    scale : float
        scale of the pareto
    shape : float
        shape of the pareto
    name : str
    """
    def __init__(self, 
            scale: float=1,
            shape: float=1,
            name: str="") -> None:
        self.scale = PositiveContinuousVariable(scale, name='scale')
        self.shape = PositiveContinuousVariable(shape, name='shape')
        self.name = name
        self._distribution = pareto(b=self.shape.value, scale=self.scale.value)
        parameters_dict = {'scale':self.scale, 'shape':self.shape}
        super().__init__(parameters_dict, self.name)

    def domain(self) -> Tuple[float, float]:
        return (self.parameters_value["scale"], float('inf'))

Add the the instance ParetoModel of the class Model in the file Models.py:

class ParetoModel(Model):
    def __init__(self, scale:Distribution, shape:Distribution) -> None:
        if scale.name == "": scale.name = "scale"
        if shape.name == "": shape.name = "shape"
        
        if scale.name == shape.name:
            raise ValueError("parameters must have different names.")
        
        self.scale = scale
        self.shape = shape
        self.type_distribution = Pareto
        self.parameters_dict = {scale.name: scale, shape.name: shape}
        super().__init__(self.parameters_dict)
        self.distrib_name = "pareto"
        self.init_method = "naive"
        
    def domain(self) -> Tuple[float, float]:
        return (0, float('inf'))
    
    def Init_theta_Quantile(self, Q, P):
        scale = Q[0]-1
        shape = (Q[-1] - Q[0]) / (Pareto(scale=scale)._distribution.ppf(P[-1]) - Pareto(scale=scale)._distribution.ppf(P[0]))
        return {self.scale.name: scale, self.shape.name: shape}
    
    def Init_theta_med_MAD(self, med, MAD):
        scale = 1
        shape = 1.5
        return {self.scale.name: scale, self.shape.name: shape}
    
    def Init_theta_med_IQR(self, med, IQR):
        scale = 1
        shape = 1.5
        return {self.scale.name: scale, self.shape.name: shape}

Example of the Pareto Type II distribution

First, define a class pareto2 that plays the same role as the functions in SciPy.

class pareto2:
    def __init__(self, loc=0, scale=1,shape=1):
        self.loc = loc
        self.scale = scale
        self.shape = shape
        
    def pdf(self,x):
        x=np.array(x)
        return np.where(x>=self.loc,(self.shape/self.scale)*(1+(x-self.loc)/self.scale)**(-self.shape-1),0)
    
    def cdf(self,x):
        x=np.array(x)
        return np.where(x>=self.loc,1-(1+(x-self.loc)/self.scale)**(-self.shape),0)
    def ppf(self,x):
        x=np.array(x)
        return self.loc+self.scale*((1-x)**(-1/self.shape)-1)
    def logpdf(self,x):
        x=np.array(x)
        return np.where(x>=self.loc,np.log(self.shape)-np.log(self.scale)-(self.shape+1)*np.log(1+(x-self.loc)/self.scale),-np.inf)
    def rvs(self,size):
        return self.ppf(np.random.random(size))

Then, add the instance ParetoType2 of the class Distribution using the above class pareto2 in the file Distribution.py:

class ParetoType2(Distribution):
    
    def __init__(self,
            loc: float=0,
            scale: float=1,
            shape: float=1,
            name: str="",
            theta: list = []) -> None:
        if theta!=[]:
            loc,scale,shape = theta
        self.loc = ContinuousVariable(loc, name='loc')
        self.scale = PositiveContinuousVariable(scale, name='scale')
        self.shape = PositiveContinuousVariable(shape, name='shape')
        self.name = name
        self._distribution = pareto2(loc=self.loc.value, scale=self.scale.value,shape=self.shape.value)
        parameters_dict = {'loc': self.loc, 'scale':self.scale, 'shape':self.shape}
        super().__init__(parameters_dict, self.name)
        
        
    def domain(self) -> Tuple[float, float]:
        return (self.parameters_value["loc"], float('inf'))

Finally, add the instance ParetoType2Model of the class Model in the file Models.py:

class ParetoType2Model(Model):
    def __init__(self, loc:Distribution, scale:Distribution, shape:Distribution) -> None:
        if loc.name == "": loc.name = "loc"
        if scale.name == "": scale.name = "scale"
        if shape.name == "": shape.name = "shape"
        
        if loc.name == scale.name or loc.name == shape.name or scale.name == shape.name:
            raise ValueError("parameters must have different names.")
        
        self.loc = loc
        self.scale = scale
        self.shape = shape
        self.type_distribution = ParetoType2
        self.parameters_dict = {loc.name: loc, scale.name: scale, shape.name: shape}
        super().__init__(self.parameters_dict)
        self.distrib_name = "pareto_type2"
        self.init_method = "naive"
        
    def domain(self) -> Tuple[float, float]:
        return (self.parameters_value["loc"], float('inf'))
    
    def Init_theta_Quantile(self, Q, P):
        loc = 2*Q[0]-Q[1]
        shape = 2.5
        scale = (Q[-1] - Q[0]) / (ParetoType2(loc=loc, shape=shape)._distribution.ppf(P[-1]) - ParetoType2(loc=loc, shape=shape)._distribution.ppf(P[0]))
        return {self.loc.name: loc, self.scale.name: scale, self.shape.name: shape}
    
    def Init_theta_med_MAD(self, med, MAD):
        loc = med-2*MAD
        scale = 1
        shape = 1.5
        return {self.loc.name: loc, self.scale.name: scale, self.shape.name: shape}
    
    def Init_theta_med_IQR(self, med, IQR):
        loc = med-IQR
        scale = 1
        shape = 1.5
        return {self.loc.name: loc, self.scale.name: scale, self.shape.name: shape}
    

MIT License

Copyright (c) [2023] [Antoine LUCIANO]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

InsufficientGibbs-1.2.1.tar.gz (8.1 MB view hashes)

Uploaded Source

Built Distribution

InsufficientGibbs-1.2.1-py3-none-any.whl (21.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page