Skip to main content

Quickly identify clusters of variables by using a scalable, computationally efficienty annealed variational Bayesian algorithm.

Project description

VBVarSel

The goal of this package is to quickly and efficiently identify clusters of variables by using a scalable, computationally efficienty annealed variational Bayes algorithm for fitting high-dimensional mixture models with variable selection.

The preprint for the associated research paper can be found here.

Installation

The VBVarSel package can be installed from github using pip:

pip install git+https://github.com/MRCBSU/PROJECT/#egg=vbvarsel

or directly from PyPI:

pip install vbvarsel

Using the package

Parameters for simulation

Parameters can be left to optional default values or may be customised by the developer.

Simulation Parameters

Simulation parameters are parameters for simulating an experiment with synthetic data. Data is created synthetically according to Crook et al, read the paper here.

import vbvarsel.vbvarsel as vbvs
# from vbvarsel import vbvarsel #alternate import method

sim_params = vbvs.global_parameters.SimulationParameters()

# default values for the simulation parameters.

n_observations: list[int] = [100,1000]
n_variables: int = 200
n_relevants: list[int] = [10, 20, 50, 100]
mixture_proportions: list[float] = [0.2, 0.3, 0.5]
means: list[int] = [-2, 0, 2]

Some things to note when customising parameters:

  • No number in n_relevants should exceed the n_variables parameter.
  • mixture_proportions total values must sum to 1.0 exactly.

Hyperparameters

Hyperparameters affect equation itself, such as how many iterations the model will have, the annealing temperature, the threshold for the convergence and so on. More information on the hyperparameters can be found within the docstrings. These as well have default values, but can be altered by the user if desired. The default Hyperparameters are described below

#Threshold for the ELBO convergence
threshold = 1e-1

#Maximum number of mixture components
k1 = 5 

#Prior coefficient count for Dirichlet prior
alpha0 = 1/(K1) #cabassi

#Shrinkage parameter of the Gaussian conditional prior
beta0 = (1e-3)*1.

#Degrees of freedom for the Gamma prior
a0 = 3.
    
#Shape parameter of the Beta distribution
d = 1

#Maximum starting annealing temperature. The default value of 1 applies no annealing.
t_max = 1.
#NOTE: t_max CANNOT equal zero. There are several functions that divide or multiply by t_max. One cannot divide by zero.
#If you need to get very close to zero, just use a very small decimal.

#Maximum number of iterations for the simulation
max_itr = 25

#Maximum number of models
max_models = 10

User-supplied data

Users may supply their own data, pending a few caveats. Data must be passed in by using a path to a file location, which is then loaded into a pandas DataFrame. Data used in the algorithm can only have numerical data. A set of labels (so-called "true labels") is preferred to verify accuracy via ARI (adjusted Rand index), but not required. If a dataset contains non-numerical data, these columns must be passed as the cols_to_skip parameter in vbvarsel.main(), and they will be dropped from the DataFrame before the algorithm commences. Users using their own data will not use any of the SimulationParameters, even if they are initialised they will be ignored.

Entry point

The packages entry point is vbvarsel.main(), and this where all the aforementioned experiment parameters will be passed. If they are not passed, they will be generated using default values or ignored in the case of user-supplied data.

Data is processed through the simulation to identify clustering of relevant data. An optional save_output parameter can be passed to save the data to the current working directory. The simulation also returns a results object, if a user wishes to use the output data for further uses.

import vbvarsel as vbvs

sim_params = vbvs.global_parameters.SimulationParameters()
hyp_params = vbvs.global_parameters.Hyperparameters()

results = vbs.vbvarsel.main(simulation_parameters=sim_params, hyperparameters=hyp_params)
#pretty much runs on its own here.

Contributing

If you are interested in contributing to this package, please submit a pull request.

Future implementations

A CLI interface.

Issues

If you come across an issue when using this package, please create an issue on the issues page and someone will respond to it as soon as we can.

License

This project is developed by the MRC-Biostatistics Unit at Cambridge University under the GNU Public license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vbvarsel-0.2.3.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vbvarsel-0.2.3-py3-none-any.whl (47.6 kB view details)

Uploaded Python 3

File details

Details for the file vbvarsel-0.2.3.tar.gz.

File metadata

  • Download URL: vbvarsel-0.2.3.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for vbvarsel-0.2.3.tar.gz
Algorithm Hash digest
SHA256 c2787af8f83d0883cf1c40e93dad8eba80f78b5a1266e100ba03cbb31f664105
MD5 fbd6ec949370fcc833d90d7d5a1d30b0
BLAKE2b-256 4f508fe534adbfdbf51d7986d15e6bd60cca239d264966e872a8611ac3c7aab3

See more details on using hashes here.

File details

Details for the file vbvarsel-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: vbvarsel-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 47.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for vbvarsel-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1a55b9d5a91507c599475c40ad1ae234eb4288915348d525c4dea7afedbd311c
MD5 f0bd62ebaedf65d7429c340b9506f12e
BLAKE2b-256 bdc5e3ba299a883d2df3d66b4f15c15a9d080ec4f6c312334e5a8487d1ec8e4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page