Fast paralleliterative cosmological parameter estimation with Bayesian nonparametrics
Project description
Gaussbock
Fast paralleliterative cosmological parameter estimation with Bayesian nonparametrics
Gaussbock is a generalpurpose tool for parameter estimation with computationally expensive likelihood calculations, developed for highdimensional cosmological parameter estimation for contemporary missions like the Dark Energy Survey (DES), and with upcoming surveys like the Large Synoptic Sky Telescope (LSST) and ESA's Euclid mission in mind. Current efforts in cosmological parameter estimation often suffer from both the computational costs of approximating distributions in highdimensional parameter spaces and the widespread need for model tuning. Specifically, calculating the likelihoods of parameter proposals through anisotropic simulations of the universe imposes high computational costs, which leads to excessive time requirements per experiment in an era of cheap parallel computing resources.
Making use of parallelization and an iterative approach to Bayesian nonparametrics, the provided method starts with a set of data points from a rough distribution guess for the true posterior. In order to obtain such a sample, the code either uses an affineinvariant MCMC ensemble as intoduced by Goodman and Weare (2004) via emcee or an initial set of data points provided by the user. After that, a variational Bayesian nonparametric Gaussian mixture model (GMM), which makes use of an infinite Dirichlet process mixtures approximated with a stickbreaking representation (see Blei and Jordan, 2006), is fitted to the data points and a consistent number of data points is sampled from the model.
These data points are then resampled via probabilities derived from truncated importance sampling, an extension of importance sampling less sensitive to proposal distributions to avoid dominating highposterior samples and introduced by Ionides (2008), for which the computation is spread over the provided cores as an embarrassingly parallel problem. The model fitting and importance sampling steps are then iteratively repeated, shifting towards an evernarrower fit around the true posterior distribution, and leading to a significant speedup in comparison to other established cosmological parameter estimation approaches. At the end, Gaussbock returns a userspecified number of samples and, if the user wishes, the importance weights and the final model itself, allowing for the sampling of an unlimited number of data points and to investigate the importance probabilities to ensure a reasonable distribution of the latter.
If you use or adapt Gaussbock, please cite the corresponding paper (arXiv:1905.09800).
Installation
Gaussbock can be installed via PyPI, with a single command in the terminal:
pip install gaussbock
Alternatively, the file gaussbock.py
can be downloaded from the folder gaussbock
in this repository and used locally by placing the file into the working directory for a given project. An installation via the terminal is, however, highly recommended, as the installation process will check for the package requirements and automatically update or install any missing dependencies, thus sparing the user the effort of troubleshooting and installing them themselves.
Quickstart guide
Only three inputs are required by Gaussbock: The list of parameter ranges, each as a tuple with the lower and upper limit for each parameter (parameter_ranges
), the handle for a posterior function that is to be used to evaluate samples (posterior_evaluation
), and the required number of posterior samples in the output (output_samples
).
Gaussbock offers a variety of optional inputs. An affineinvariate MCMC ensemble to obtain an initial approximation of the posterior can be used, or users provide their own initial sample (initial_samples
). This takes the form ['automatic', int, int]
for the MCMC ensemble, with the integer referring to the number of walker and steps per walker, respectively, or ['custom', arraylike]
, with the arraylike object providing the initial samples in the parameter space.
The maximum number of Gaussbock iterations (gaussbock_iterations
) can be set, as can the threshold for the convergence criterion using interiteration variance differences (convergence_threshold
). Other optional inputs include the number of samples that are drawn from the current posterior approximation before each truncated importance sampling step (mixture_samples
) and the maximum number of expectationmaximization (EM) iterations to fit the variational Bayesian GMM (em_iterations
).
In addition, the start and end for a shrinking convergence threshold for the posterior fitting in the form [float, float]
can be provided (tolerance_range
), as can the maximum number of Gaussians to be fitted to samples in each iteration (model_components
) and the tpe of covariance parameters from the set {'full', 'tied', 'diag', spherical'}
used for the fitting process (model_covariance
). Another optional input is the method used to initialize the model's weights, means and covariances as either 'kmeans'
or 'random'
as possible values (parameter_init
). The armount of information the model fitting should provide during runtime can be set as 0
, 1
or 2
(model_verbosity
).
In order to make use of parallelization, the user can either choose to use MPI pools and set the corresponding input to True
(mpi_parallelization
), for example for running on supercomputers or local clusters, or specify a number of processes for multicore parallelization (processes
). Another boolean input is the choice whether both importance weights and the final model should be returned (weights_and_model
). Since truncated importance sampling is used, the truncation value for importance probability reweighting can be chosen as a float from [1.0, 3.0]
. Lower values lead to a more general fitting with strong truncation, whereas smaller values result in a higher level of retained dominant data points. This input should only be customized if the approximation is problematic and can't be resolved via other inputs.
The model used for the fitting process can be selected, with 'kde
' for kernel density estimation (KDE) being the default for problems in less than three dimensions, and 'gmm'
being the default otherwise to use the variational Bayesian GMM suitable for higherdimensional problems. Lastly, if KDE is used, the kernel bandwidth that should be used can be specified (kde_bandwidth
). The required and optional inputs, together with their default values and with optional inputs marked with an asterisk, are listed below, with D denoting the dimensionality of the parameter estimation problem, or the number of parameters.
Variables  Explanations  Default 

parameter_ranges  The lower and upper limit for each parameter  
posterior_evaluation  Evaluation function handle for the posterior  
output_samples  Number of posterior samples that are required  
* initial_samples  Choice of 'emcee' or a provided start sample  ['automatic', 50, 1000] 
* gaussbock_iterations  Maximum number of Gaussbock iterations  10 
*convergence_threshold  Threshold for interiteration convegence checks  1e3 
* mixture_samples  Number of samples drawn for importance sampling  1e5 
* em_iterations  Maximum number of EM iterations for the mixture model  1000 
* tolerance_range  The range for the shrinking convergence threshold  [1e2, 1e7] 
* model_components  Maximum number of Gaussians fitted to samples  ceiling((2 / 3) * D) 
* model_covariance  Type of covariance for the GMM fitting process  'full' 
* parameter_init  How to intialize model weights, means and covariances  'random' 
* model_verbosity  The amount of information printed during runtime  1 
* mpi_parallelization  Whether to parallelize Gaussbock using an MPI pool  False 
* processes  Number of processes Gaussbock should parallelize over  1 
* weights_and_model  Whether to return importance weights and the model  False 
* truncation_alpha  Truncation value for importance probability reweighting  2.0 
* model_selection  Type of model used for the fitting process  None 
* kde_bandwidth  Kernel bandwidth used when fitting via KDE  0.5 
After the installation via PyPI, or using the gaussbock.py
file locally, the usage looks like this:
from gaussbock import gaussbock
output = gaussbock(parameter_ranges = your_posterior_ranges,
posterior_evaluation = your_posterior_function,
output_samples = your_required_samples,
mpi_parallelization = True,
weights_and_model = True)
samples, weights, model = output
Note that, in the above example, we use two of the optional parameters to tell the tool to parallelize using MPI, for example for the use on a supercomputer, and to return the importance weights and the model, for example for weighting the returned samples and saving the model to draw further samples later. If we wouldn't set the weights and model return indicator to be true, there would be no need to split the output up, as the output would be just the list of samples.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gaussbock1.0.9py3noneany.whl
Algorithm  Hash digest  

SHA256  944d7f393818cc4de87c4c408cb5b005924fb8f6b8f9ff2f3893ef3a5c75507a 

MD5  e53fbc6312897925230cad1872a8d002 

BLAKE2b256  280fa80f270ba30bea717f747a722c040fa910a47b6197ac8baaa9c738f8d7c8 