Skip to main content

Selects variable clusters for experimental design (need to fix this later)

Project description

VBVarSel

The goal of this package is to quickly and efficiently identify clusters of variables by using a scalable, computationally efficienty annealed variational Bayes algorithm for fitting high-dimensional mixture models with variable selection.

The preprint for the associated research paper can be found here.

Installation

The VBVarSel package can be installed from github using pip:

pip install git+https://github.com/MRCBSU/PROJECT/#egg=vbvarsel

or directly from PyPI:

pip install vbvarsel

Using the package

Parameters for simulation

Parameters can be left to optional default values or may be customised by the developer.

Simulation Parameters

Simulation parameters are parameters for simulating an experiment with synthetic data. Data is created synthetically according to Crook et al, read the paper here.

import vbvarsel as vbvs

sim_params = vbvs.global_parameters.SimulationParameters()

# default values for the simulation parameters.

n_observations: list[int] = [100,1000]
n_variables: int = 200
n_relevants: list[int] = [10, 20, 50, 100]
mixture_proportions: list[float] = [0.2, 0.3, 0.5]
means: list[int] = [-2, 0, 2]

Some things to note when customising parameters:

  • No number in n_relevants should exceed the n_variables parameter.
  • mixture_proportions total values must sum to 1.0 exactly.

Hyperparameters

Hyperparameters affect equation itself, such as how many iterations the model will have, the annealing temperature, the threshold for the convergence and so on. More information on the hyperparameters can be found within the docstrings. These as well have default values, but can be altered by the user if desired. The default Hyperparameters are described below

#Threshold for the ELBO convergence
threshold = 1e-1

#Maximum number of mixture components
k1 = 5 

#Prior coefficient count for Dirichlet prior
alpha0 = 1/(K1) #cabassi

#Shrinkage parameter of the Gaussian conditional prior
beta0 = (1e-3)*1.

#Degrees of freedom for the Gamma prior
a0 = 3.
    
#Shape parameter of the Beta distribution
d = 1

#Maximum starting annealing temperature. The default value of 1 applies no annealing.
t_max = 1.
#NOTE: t_max CANNOT equal zero. There are several functions that divide or multiply by t_max. One cannot divide by zero.
#If you need to get very close to zero, just use a very small decimal.

#Maximum number of iterations for the simulation
max_itr = 25

#Maximum number of models
max_models = 10

User-supplied data

Users may supply their own data, pending a few caveats. Data must be passed in by using a path to a file location, which is then loaded into a pandas DataFrame. Data used in the algorithm can only have numerical data. A set of labels (so-called "true labels") is preferred to verify accuracy via ARI (adjusted Rand index), but not required. If a dataset contains non-numerical data, these columns must be passed as the cols_to_skip parameter in vbvarsel.main(), and they will be dropped from the DataFrame before the algorithm commences. Users using their own data will not use any of the SimulationParameters, even if they are initialised they will be ignored.

Entry point

The packages entry point is vbvarsel.main(), and this where all the aforementioned experiment parameters will be passed. If they are not passed, they will be generated using default values or ignored in the case of user-supplied data.

Data is processed through the simulation to identify clustering of relevant data. An optional save_output parameter can be passed to save the data to the current working directory. The simulation also returns a results object, if a user wishes to use the output data for further uses.

import vbvarsel as vbvs

sim_params = vbvs.global_parameters.SimulationParameters()
hyp_params = vbvs.global_parameters.Hyperparameters()

results = vbs.vbvarsel.main(simulation_parameters=sim_params, hyperparameters=hyp_params)
#pretty much runs on its own here.

Contributing

If you are interested in contributing to this package, please submit a pull request.

Future implementations

A CLI interface.

Issues

If you come across an issue when using this package, please create an issue on the issues page and someone will respond to it as soon as we can.

License

This project is developed by the MRC-Biostatistics Unit at Cambridge University under the GNU Public license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vbvarsel-0.2.1.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vbvarsel-0.2.1-py3-none-any.whl (48.0 kB view details)

Uploaded Python 3

File details

Details for the file vbvarsel-0.2.1.tar.gz.

File metadata

  • Download URL: vbvarsel-0.2.1.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for vbvarsel-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3bcb713b31fbbcbb4fe4aa7cb9eb201c20af3055d7e9637373d113696422b82c
MD5 21073ffe969a0b59d6a70e3a51195355
BLAKE2b-256 213a70a4c5c105ac6aeb6f2f383701c64f8dc80fdc30ba19e379a92f04c31070

See more details on using hashes here.

File details

Details for the file vbvarsel-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: vbvarsel-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 48.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.6

File hashes

Hashes for vbvarsel-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 96abb4eb173a8774beefbfea7e4332971832e716b64e241b4c19452b3b0ed8af
MD5 d12ad632f18313cc134f990c80b96574
BLAKE2b-256 06d7ecb2f41129a0a4efecbf4c297282c9d3a84b656e86d00dbe0e76853fc4bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page