VectorSpace Markov Random Fields
Project description
VectorSpace Markov Random Fields (VSMRFs)
============================================
![Example of a learned VSMRF](https://raw.githubusercontent.com/tansey/vsmrfs/master/data/mfp_top.png)
This package provides support for learning MRFs where each nodeconditional can be any generic exponential family distribution. Currently supported nodeconditionals are Bernoulli, Gamma, Gaussian, Dirichlet, and PointInflated models. See `exponential_families.py` for guidance on how to add a custom nodeconditional distribution.
Installation

`pip install vsmrfs`
Requires `numpy`, `scipy`, and `matplotlib`.
Running

### 0) Experiment directory structure
The `vsmrfs` package is very opinionated. It assumes you have an experiment directory with a very specific structure. If your experiment directory is `exp`, then you need the following structure:
```
exp/
args/
data/
nodes.csv
sufficient_statistics.csv
edges/
metrics/
plots/
weights/
```
Note that all you really need at first is the `exp/data/` structure and the two files. If you generate your data from the synthetic dataset creator, the structure will be created for you.
### 1a) Generating synthetic data (optional)
If you are just trying to run the package or conduct some benchmarks on the algorithm, you can create a synthetic dataset which will setup all of the experiment structure for you. For example, say you wanted to run an experiment with two Bernoulli nodes and a threedimensional Dirichlet node, and you want to use `foo/` as your experiment directory:
`vsmrfgen foo nodes b b d3`
This will generate a `foo` directory and all of the structure from step 0, using some default parameters for sparsity and sample sizes. You can see the full list of options with `vsmrfgen help`.
### 1b) Preparing your data (alternative to 1a)
If you actually have some data that you're trying to model, you need to get it into the right format. Assuming you know your data types, and your experiment directory is `foo`, you need to generate two files:
`foo/data/nodes.csv`: This is a singleline CSV file containing the datatypes of all your nodeconditionals. Currently supported options:
 `b`: Bernoulli node
 `n`: Normal or Gaussian node
 `g`: Gamma node
 `d#`: Dirichlet node, where # is replaced with the dimensionality of the Dirichlet, e.g. `d3` for a 3parameter Dirichlet
 `ziX`: Zeroinflated or generic pointinflated node, where `X` is replaced by the inflated distribution. This is a recursive definition, so you can have multiple inflated points, e.g., `zizig` would be a twopoint inflated Gamma distribution.
`foo/data/sufficient_statistics.csv`: A CSV matrix of sufficient statistics for all of the nodeconditionals. The first line should be the columntonodeID mapping. So for example if you have a dataset of two Bernoulli nodes and a 3dimensional Dirichlet, your header would look like `0,1,2,2,2` since `node0` and `node1` are both Bernoulli (i.e. univariate sufficient statistics) and `node2` is your Dirichlet, with 3 sufficient statistics. Every subsequent row in the file then corresponds to a data sample.
### 2) MLE via Pseudolikelihood
To learn a VSMRF, we make a pseudolikelihood approximation that effectively decouples all the nodes. This makes the problem convex and separable, enabling us to learn each node independently. If you have access to a cluster or distributed compute environment, this makes the process very fast since you can learn each node on a different machine, then stitch the whole graph back together in step 3.
Say you want to learn the Dirichlet node from step 2, using a solution path approach so you can avoid hyperparameter setting:
`vsmrflearn foo target 2 solution_path`
This will load the data from the `foo` experiment directory and learn the pseudoedges for `node2`, which is our Dirichlet node. You can see the full list of options with `vsmrflearn help`.
### 3) Stitching the MRF together
Once all the nodes have been learned, the pseudoedges need to be combined back together to form a single graph. Since we are learning approximate models, there will be times when a pseudoedge for `nodeAnodeB` exists but `nodeBnodeA` does not. In that case, we have to decide whether to include the edge in the final graph or not; that is, should we `OR` the edges together or `AND` them? The package will create both, but empirically it seems performance is slightly better with the `AND` graph.
Continuing our example, to stitch together our three node MRF:
`vsmrfstitch foo nodes b b d3`
If you generated your data synthetically, such that you know the ground truth of the model, you can evaluate the resulting graph:
`vsmrfstitch foo nodes b b d3 evaluate`
See `vsmrfstitch help` for a full list of options.
Reference

```
@inproceedings{tansey:etal:2015,
title={VectorSpace Markov Random Fields via Exponential Families},
author={Tansey, Wesley and MadridPadilla, Oscar H and Suggala, Arun and Ravikumar, Pradeep},
booktitle={Proceedings of the 32nd International Conference on Machine Learning (ICML15)},
year={2015}
}
```
============================================
![Example of a learned VSMRF](https://raw.githubusercontent.com/tansey/vsmrfs/master/data/mfp_top.png)
This package provides support for learning MRFs where each nodeconditional can be any generic exponential family distribution. Currently supported nodeconditionals are Bernoulli, Gamma, Gaussian, Dirichlet, and PointInflated models. See `exponential_families.py` for guidance on how to add a custom nodeconditional distribution.
Installation

`pip install vsmrfs`
Requires `numpy`, `scipy`, and `matplotlib`.
Running

### 0) Experiment directory structure
The `vsmrfs` package is very opinionated. It assumes you have an experiment directory with a very specific structure. If your experiment directory is `exp`, then you need the following structure:
```
exp/
args/
data/
nodes.csv
sufficient_statistics.csv
edges/
metrics/
plots/
weights/
```
Note that all you really need at first is the `exp/data/` structure and the two files. If you generate your data from the synthetic dataset creator, the structure will be created for you.
### 1a) Generating synthetic data (optional)
If you are just trying to run the package or conduct some benchmarks on the algorithm, you can create a synthetic dataset which will setup all of the experiment structure for you. For example, say you wanted to run an experiment with two Bernoulli nodes and a threedimensional Dirichlet node, and you want to use `foo/` as your experiment directory:
`vsmrfgen foo nodes b b d3`
This will generate a `foo` directory and all of the structure from step 0, using some default parameters for sparsity and sample sizes. You can see the full list of options with `vsmrfgen help`.
### 1b) Preparing your data (alternative to 1a)
If you actually have some data that you're trying to model, you need to get it into the right format. Assuming you know your data types, and your experiment directory is `foo`, you need to generate two files:
`foo/data/nodes.csv`: This is a singleline CSV file containing the datatypes of all your nodeconditionals. Currently supported options:
 `b`: Bernoulli node
 `n`: Normal or Gaussian node
 `g`: Gamma node
 `d#`: Dirichlet node, where # is replaced with the dimensionality of the Dirichlet, e.g. `d3` for a 3parameter Dirichlet
 `ziX`: Zeroinflated or generic pointinflated node, where `X` is replaced by the inflated distribution. This is a recursive definition, so you can have multiple inflated points, e.g., `zizig` would be a twopoint inflated Gamma distribution.
`foo/data/sufficient_statistics.csv`: A CSV matrix of sufficient statistics for all of the nodeconditionals. The first line should be the columntonodeID mapping. So for example if you have a dataset of two Bernoulli nodes and a 3dimensional Dirichlet, your header would look like `0,1,2,2,2` since `node0` and `node1` are both Bernoulli (i.e. univariate sufficient statistics) and `node2` is your Dirichlet, with 3 sufficient statistics. Every subsequent row in the file then corresponds to a data sample.
### 2) MLE via Pseudolikelihood
To learn a VSMRF, we make a pseudolikelihood approximation that effectively decouples all the nodes. This makes the problem convex and separable, enabling us to learn each node independently. If you have access to a cluster or distributed compute environment, this makes the process very fast since you can learn each node on a different machine, then stitch the whole graph back together in step 3.
Say you want to learn the Dirichlet node from step 2, using a solution path approach so you can avoid hyperparameter setting:
`vsmrflearn foo target 2 solution_path`
This will load the data from the `foo` experiment directory and learn the pseudoedges for `node2`, which is our Dirichlet node. You can see the full list of options with `vsmrflearn help`.
### 3) Stitching the MRF together
Once all the nodes have been learned, the pseudoedges need to be combined back together to form a single graph. Since we are learning approximate models, there will be times when a pseudoedge for `nodeAnodeB` exists but `nodeBnodeA` does not. In that case, we have to decide whether to include the edge in the final graph or not; that is, should we `OR` the edges together or `AND` them? The package will create both, but empirically it seems performance is slightly better with the `AND` graph.
Continuing our example, to stitch together our three node MRF:
`vsmrfstitch foo nodes b b d3`
If you generated your data synthetically, such that you know the ground truth of the model, you can evaluate the resulting graph:
`vsmrfstitch foo nodes b b d3 evaluate`
See `vsmrfstitch help` for a full list of options.
Reference

```
@inproceedings{tansey:etal:2015,
title={VectorSpace Markov Random Fields via Exponential Families},
author={Tansey, Wesley and MadridPadilla, Oscar H and Suggala, Arun and Ravikumar, Pradeep},
booktitle={Proceedings of the 32nd International Conference on Machine Learning (ICML15)},
year={2015}
}
```
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size vsmrfs0.9.0cp27nonemacosx_10_9_x86_64.whl (6.9 kB)  File type Wheel  Python version cp27  Upload date  Hashes View 
Filename, size vsmrfs0.9.0.tar.gz (4.0 kB)  File type Source  Python version None  Upload date  Hashes View 
Close
Hashes for vsmrfs0.9.0cp27nonemacosx_10_9_x86_64.whl
Algorithm  Hash digest  

SHA256  eb5214fcbc879fa032f3568c56592cdb25e2e69ab8ed086430fce6749b3d7f0b 

MD5  7187546b2fb70129a67ec09b7138e8fa 

BLAKE2256  5f82d2fa222730bdcdea98e73ff0dd19f58e20cc0d3e5ef798ea51916235ed1d 