Implementation of Markov Chain Bayesian Clustering techniques, including DPM and MFM, with an abstract Mixture Model and Component Model API.

## Project description

# Bayesian Markov Chain Monte Carlo Clustering

Implementation of Markov Chain Monte Carlo Bayesian Clustering techniques, including DPM (Dirichlet Process Mixture Models [1]) and MFM (Mixture of Finite Mixtures [2]) mixture models, with an abstract Mixture Model and Component Model API.

Hyperparameter updates for DPM are implemented using an Empirical Bayes update procedure [3].

Final configuration selection is implemented using Least Squares clustering [4].

## Usage

### Installation and Setup

Python:
First, install with `pip install bmcc`

(or `pip3`

, depending on your version). Then, simply `import bmcc`

.
**NOTE**: Only Python 3 is officially supported.

R:
First, make sure Python 3 is installed, and install bmcc with `pip install bmcc`

. Then, install the R package with

```
library(devtools)
install_github("https://github.com/thetianshuhuang/bmcc_r")
```

To use, load the package `bmcc`

. You will also need to load `reticulate`

in order to deal with type conversions.

```
library(bmcc)
library(reticulate)
```

### Expected Types

The dataset should be an array with two dimensions, where each row is a data point. Arrays should be numpy arrays with data type float64, in contiguous C order.

- Python:
`data = np.array(<source>, dtype=np.float64)`

- R/reticulate:
`data = np_array(<source>, dtype="float64", order="C")`

Assignment vectors should be arrays. The value at each index indicates the cluster assignment. Since clusters are unordered, the value itself has no meaning, and should be ignored other than to determine uniqueness. Assignment vectors have type uint16.

- Python:
`assignments = np.array(<source>, dtype=np.uint16)`

- R/reticulate:
`assignments = np_array(<source>, dtype="uint16")`

Note that when using reticulate, R types default to 'numeric' (double). When calling functions that require integer arguments (i.e. indices, number of dimensions), integers must be explicitly specified:

```
x <- 25
x <- as.integer(x)
# or
x <- 25L
```

### Creating the Model

The model has two parts: the mixture model, and the component model. Currently, the component model has one option, and the mixture model has two options.

#### Normal Wishart

The normal wishart model assumes each component is drawn from a wishart distribution with degrees of freedom specified in the initializer, and scale matrix proportional to C^-1/df, where C is the covariance matrix of the observed points.

- Python:

```
component_model = bmcc.NormalWishart(df=3)
```

- R/reticulate:

```
component_model = NormalWishart(df = 3)
```

#### MFM (Mixture of Finite Mixtures)

See [2]. Takes two arguments: the dirichlet mixing parameter gamma, and a log prior function on the number of clusters. Gamma defaults to 1, and the prior defaults to poisson(mu=4).
**NOTE**: Make sure that the prior is given in log form!

- Python:

```
mixture_model = bmcc.MFM(gamma=1, prior=lambda k: poisson.logpmf(k, 4))
```

- R/reticulate:

```
prior <- function(k) { dpois(k, 4, log = TRUE) }
mixture_model = MFM(gamma=1, prior=py_func(prior))
```

### Running the Gibbs Sampler

Currently, only collapsed gibbs samplers are implemented. Is is possible to extend this to a general gibbs sampler using the API (documentation todo), but for now, the core library only implements collapsed gibbs.

The GibbsMixtureModel gibbs sampler takes 5 arguments: the dataset, the models created previously, an initial assignment vector (usually assigning all points to the same cluster), and a thinning factor. If the thinning factor is 1, all samples are kept; otherwise, only one sample out of every `thinning`

samples are kept, with the rest being immediately discarded.

- Python:

```
sampler = bmcc.GibbsMixtureModel(
data=data,
component_model=component_model,
mixture_model=mixture_model,
assignments=np.zeros(1000).astype(np.uint16),
thinning=1)
```

- R:

```
sampler = GibbsMixtureModel(
data=data,
component_model=component_model,
mixture_model=mixture_model,
assignments=np_array(rep(0, 1000), dtype="uint16"),
thinning=1L)
```

Finally, simply call the `iter`

method once for every iteration:

- Python:

```
for i in range(1000):
sampler.iter()
```

- R:

```
for(i in 1:1000):
sampler$iter()
```

You can also call `iter`

with an argument (i.e. `sampler.iter(10)`

) to run multiple iterations at once. I suggest running the loop with a progress bar of some sort:

- Python:

```
import tqdm
# ...
for i in tqdm(range(1000)):
sampler.iter()
```

- R:

```
pb <- txtProgressBar(min = 0, max = 5000, style = 3)
for(i in 1:5000) {
model$iter()
setTxtProgressBar(pb, i)
}
close(pb)
```

### Selecting a Result

Currently, only least squares configuration selection is implemented. Run by calling the `select_lstsq`

method of the `GibbsMixtureModel`

object with the burn in duration.

- Python:

```
res = sampler.select_lstsq(burn_in=100)
```

- R:

```
res <- sampler$select_lstsq(burn_in=100L)
```

If ground truths are available, call the `evaluate`

method of the resulting `LstsqResult`

object to run evaluation statistics. If oracle information is available, the `oracle`

(oracle assignments) and `oracle_matrix`

(oracle pairwise probability) arguments can optionally be passed to allow comparison.

- Python:

```
res.evaluate(
dataset.assignments,
oracle=dataset.oracle,
oracle_matrix=dataset.oracle_matrix)
# Plot traces
res.trace(plot=True)
# Plot pairwise membership and probability matrices
res.matrices(plot=True)
# Plot clustering
res.clustering(plot=True)
```

- R:

```
res$evaluate(
dataset$assignments,
oracle=dataset$oracle,
oracle_matrix=dataset$oracle_matrix)
# Plot traces
res$trace(plot=TRUE)
# Plot pairwise membership and probability matrices
res$matrices(plot=TRUE)
# Plot clustering
res$clustering(plot=TRUE)
```

## References

[1] Radford M. Neal (2000), "Markov Chain Sampling Methods for Dirichlet Process Mixture Models". Journal of Computational and Graphical Statistics, Vol. 9, No. 2.

[2] Jeffrey W. Miller, Matthew T. Harrison (2018), "Mixture Models with a Prior on the Number of Components". Journal of the American Statistical Association, Vol. 113, Issue 521.

[3] Jon D. McAuliffe, David M. Blei, Michael I. Jordan (2006), "Nonparametric empirical Bayes for the Dirichlet process mixture model". Statistics and Computing, Vol. 16, Issue 1.

[4] David B. Dahl (2006), "Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model". Bayesian Inference for Gene Expression and Proteomics.

## Project details

## Release history Release notifications | RSS feed

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.