Parallel Implementation of Boosting Algorithms with MPI.

## Project description

*******************************************************************************
PBOOST : Hybrid Parallel Implementation of Boosted Trees

Mehmet Basbug
PhD Candidate
Princeton University

February 2013
*******************************************************************************

A.QUICK INSTALL
-------------------------------------------------------------------------------
First make sure that you have OpenMPI and HDF5 installed. Then run the following command to install pboost with all the dependencies:

pip install pboost

OR

easy_install pboost

This will also create a directory named 'pboost' in your home folder.

This should work fine for many users; however, you may be missing some required libraries. On Debian/Ubuntu run the following command first (add sudo if necessary)

apt-get install build-essential python-dev python-setuptools python-pip python-numpy python-scipy python-matplotlib openmpi-bin openmpi-doc libopenmpi-dev libhdf5-serial-dev

B.RUNNING DEMO
-------------------------------------------------------------------------------
To run the demo, execute the following command

mpirun -np 2 python ~/pboost/run.py 0

You should see something similar to the following output

Info : Confusion matrix for combined validation error with zero threshold :
{'FP': 108, 'TN': 392, 'FN': 115, 'TP': 153}
Info : Training Error of the final classifier : 0.0
Info : Validation Error of the final classifier : 0.29296875
2015-10-22 12:25:46.468247 Rank 0 Info : Reporter finished running.
2015-10-22 12:25:46.468293 Rank 0 Info : Total runtime : 25.71 seconds

A quick note, if you installed pboost with sudo privileges, you may need to change the ownership of pboost folder and the files in it. In linux, run the following command

-------------------------------------------------------------------------------
The recommended way is appending your new configurations to configurations.cfg file in pboost folder. Alternatively you can create an empty text file with the extension of cfg. A shorter explanation of options can be found in configurations.cfg file. For the detailed explanations see below:

# train_file = [HDF5 File]
# Raw file of examples and their attributes for training
# (Must be specified)
#
# test_file = [HDF5 File]
# Raw file of examples and their attributes for testing
# (Optional)

Data files (train_file and test_file) should be in HDF5 format and the raw data must be named 'data' and label must be named 'label'. Alternatively one can use the last column of 'data' dataset as the label.

If a separate 'label' dataset is not given, the program will assume the last column as labels. At this point only binary labeling is supported; therefore, all the labels should be zero or one.

If there is no separate test dataset, please leave test_data option blank.

# factory_files = [.py File(s), separated by comma]
# File(s) containing the concrete implementations of
# the BaseFeatureFactory class
# (Leaving blank will include default implementation)

factory_files option refers to the files containing user defined feature factories. See examples of Feature Factory classes in pima_feature_factory.py file in demo folder.

In order to use each column of your raw data as a feature, leave this option blank or use the keyword default.

More than one options can be specified, for instance the following is perfectly okay.
factory_files = pima_feature_factory.py,my_feature_factory.py

Each user defined Feature Factory class should inherit from BaseFeatureFactory and override blueprint and make methods. The blueprint method refers to the functionality of features. The produce method is about giving arguments to actually create features. One should call make method for every single feature creation within that method. All Feature Factory Files should be in the working directory.

# working_dir = [string]
# Path to a shared disk location containing data files, factory files
# (Leaving blank will set it to the same directory as this file)

Working directory should be on a shared disk space. Data files and feature factory files should be put in that directory.

# algorithm = [string]
# Boosting algorithm to use.
# (Leaving blank will set this field to "conf-rated")

Currently only "adaboost", "adaboost-fast" and "conf-rated" are supported. The second one is recommended as it is likely to converge faster.

# rounds = [integer]
# Number of rounds that the boosting algorithm should iterate for
# (Leaving blank will set this number to 100)

The runtime of the program is directly proportional to the number of rounds. It is recommended not give a very large number; training error of boosting usually converges with a moderate number of rounds and a large number might cause over fitting.

# xval_no = [integer]
# Number of cross-validation folds
# (Leaving blank will set it to 1, that is disabling x validation)

In order to have k-fold cross validation, set this number to k. Cross validation is useful when you do not have a separate testing dataset. In this case, validation error might be a good estimate of testing error. A large number should be avoided, as it will increase the runtime. For k-fold cross validation pboost will train k+1 algorithms, meaning that it will take k times longer compared to not using cross validation.

# max_memory = [integer/float]
# Amount of available RAM to each core, in GigaBytes
# (Leaving blank will set this number to 2, that is 2GB per core)

Since MPI does not treat differently to cores on the same node, this option should be specified in per core basis.

# show_plots = [y/n]
# Enable to show the plots before the program terminates
# (Leaving blank will set this field to y)

In both cases plots will be saved in the output directory

# deduplication = [y/n]
# Enable deduplication for extracted features
# (Leaving blank will set this field to n)

# Number of OMP threads per MPI process. Only available for adaboost-fast
# (Leaving blank will set this field to 1)

The recommended parallelization is to set omp_thread to the number of cores in each node and create mpi threads per node with -bynode flag of mpirun

# tree_depth = [integer]
# The height of the decision tree
# (Leaving blank will set this field to 1)

Fixed height decision tree is used as weak learner. Height of 1 corresponds to Decision Stumps. The type of boosting algorithm determines which objection function is minimized when a node is created.

All the necessary output data is always stored in an output folder and plots can be generated later on.

-------------------------------------------------------------------------------
run.py takes two arguments cp and cn
cn : Configuration numbers to process
cp (optional) : Path to the configuration file

A typical command to run pboost looks like

mpirun -np NUMBER_OF_MPI_THREADS python run.py -cp CONF_PATH CONF_NUM_1 CONF_NUM2

The default setting for cp is '~/pboost/configurations.cfg'

The output of the program will be in the folder out_{CONF_NUM} in working directory. The output folder contains the following files in it:

feature factory files : All Feature Factory Files necessary to make necessary features
final_classifier.py : A script to classify any given raw data
hypotheses.npy : Data file containing hypotheses found by pboost
dump.npz : Data file containing predictions and other information about the job

One can generate plots using plot.py and giving dump.npz as input to that script

E. SUBMITTING A PBOOST JOB VIA PBS
-------------------------------------------------------------------------------
The following script is an example of how to submit a pboost job using PBS scheduler. The script allocates 4 nodes and 16 cores per node for the job for 24 hours. Pboost will run experiments 1 to 100(included) as described in the configuration file experiments.cfg

#!/bin/bash
#PBS -l nodes=4:ppn=16,walltime=24:00:00

cd /home/mbasbug/pboost
mpirun -np 4 -bynode python2.7 run.py -cp ./experiments.cfg 1-100

Notice that the number of omp threads is set in the configuration file

## Project details

Uploaded source