Selects variable clusters for experimental design (need to fix this later)
Project description
VBVarSel
The goal of this package is to quickly and efficiently identify clusters of variables by using a scalable, computationally efficienty annealed variational Bayes algorithm for fitting high-dimensional mixture models with variable selection.
The preprint for the associated research paper can be found here.
Installation
The VBVarSel package can be installed from github using pip:
pip install git+https://github.com/MRCBSU/PROJECT/#egg=vbvarsel
or directly from PyPI:
pip install vbvarsel
Using the package
Parameters for simulation
Parameters can be left to optional default values or may be customised by the developer.
Simulation Parameters
Simulation parameters are parameters for simulating an experiment with synthetic data. Data is created synthetically according to Crook et al, read the paper here.
import vbvarsel as vbvs
sim_params = vbvs.global_parameters.SimulationParameters()
# default values for the simulation parameters.
n_observations: list[int] = [100,1000]
n_variables: int = 200
n_relevants: list[int] = [10, 20, 50, 100]
mixture_proportions: list[float] = [0.2, 0.3, 0.5]
means: list[int] = [-2, 0, 2]
Some things to note when customising parameters:
- No number in
n_relevantsshould exceed then_variablesparameter. mixture_proportionstotal values must sum to 1.0 exactly.
Hyperparameters
Hyperparameters affect equation itself, such as how many iterations the model will have, the annealing temperature, the threshold for the convergence and so on. More information on the hyperparameters can be found within the docstrings. These as well have default values, but can be altered by the user if desired. The default Hyperparameters are described below
#Threshold for the ELBO convergence
threshold = 1e-1
#Maximum number of mixture components
k1 = 5
#Prior coefficient count for Dirichlet prior
alpha0 = 1/(K1) #cabassi
#Shrinkage parameter of the Gaussian conditional prior
beta0 = (1e-3)*1.
#Degrees of freedom for the Gamma prior
a0 = 3.
#Shape parameter of the Beta distribution
d = 1
#Maximum starting annealing temperature. The default value of 1 applies no annealing.
t_max = 1.
#NOTE: t_max CANNOT equal zero. There are several functions that divide or multiply by t_max. One cannot divide by zero.
#If you need to get very close to zero, just use a very small decimal.
#Maximum number of iterations for the simulation
max_itr = 25
#Maximum number of models
max_models = 10
User-supplied data
Users may supply their own data, pending a few caveats. Data must be passed in by using a path to a file location, which is then loaded into a pandas DataFrame. Data used in the algorithm can only have numerical data. A set of labels (so-called "true labels") is preferred to verify accuracy via ARI (adjusted Rand index), but not required. If a dataset contains non-numerical data, these columns must be passed as the cols_to_skip parameter in vbvarsel.main(), and they will be dropped from the DataFrame before the algorithm commences. Users using their own data will not use any of the SimulationParameters, even if they are initialised they will be ignored.
Entry point
The packages entry point is vbvarsel.main(), and this where all the aforementioned experiment parameters will be passed. If they are not passed, they will be generated using default values or ignored in the case of user-supplied data.
Data is processed through the simulation to identify clustering of relevant data. An optional save_output parameter can be passed to save the data to the current working directory. The simulation also returns a results object, if a user wishes to
use the output data for further uses.
import vbvarsel as vbvs
sim_params = vbvs.global_parameters.SimulationParameters()
hyp_params = vbvs.global_parameters.Hyperparameters()
results = vbs.vbvarsel.main(simulation_parameters=sim_params, hyperparameters=hyp_params)
#pretty much runs on its own here.
Contributing
If you are interested in contributing to this package, please submit a pull request.
Future implementations
A CLI interface.
Issues
If you come across an issue when using this package, please create an issue on the issues page and someone will respond to it as soon as we can.
License
This project is developed by the MRC-Biostatistics Unit at Cambridge University under the GNU Public license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vbvarsel-0.2.1.tar.gz.
File metadata
- Download URL: vbvarsel-0.2.1.tar.gz
- Upload date:
- Size: 8.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bcb713b31fbbcbb4fe4aa7cb9eb201c20af3055d7e9637373d113696422b82c
|
|
| MD5 |
21073ffe969a0b59d6a70e3a51195355
|
|
| BLAKE2b-256 |
213a70a4c5c105ac6aeb6f2f383701c64f8dc80fdc30ba19e379a92f04c31070
|
File details
Details for the file vbvarsel-0.2.1-py3-none-any.whl.
File metadata
- Download URL: vbvarsel-0.2.1-py3-none-any.whl
- Upload date:
- Size: 48.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96abb4eb173a8774beefbfea7e4332971832e716b64e241b4c19452b3b0ed8af
|
|
| MD5 |
d12ad632f18313cc134f990c80b96574
|
|
| BLAKE2b-256 |
06d7ecb2f41129a0a4efecbf4c297282c9d3a84b656e86d00dbe0e76853fc4bb
|