Skip to main content

Parallel Implementation of Boosting Algorithms with MPI.

Project description

*******************************************************************************
PBOOST : Parallel Implementation of Boosting Algorithms with MPI

Mehmet Basbug
PhD Candidate
Princeton University

February 2013
*******************************************************************************

A.QUICK INSTALL
-------------------------------------------------------------------------------
First make sure that you have OpenMPI and HDF5 installed. Then run the following command to install pboost with all the dependencies:

pip install pboost

OR

easy_install pboost

This will also create a directory named 'pboost' in your home folder.

This should work fine for many users; however, you may be missing some required libraries. On Debian/Ubuntu run the following command first (add sudo if necessary)

apt-get install build-essential python-dev python-setuptools python-pip python-numpy python-scipy python-matplotlib openmpi-bin openmpi-doc libopenmpi-dev libhdf5-serial-dev


B.RUNNING DEMO
-------------------------------------------------------------------------------
To run the demo, execute the following command

mpirun -np 2 python ~/pboost/run.py 0

You should see something similar to the following output

Info : Confusion matrix for combined validation error with zero threshold :
{'FP': 11, 'TN': 89, 'FN': 10, 'TP': 90}
Info : Confusion matrix for testing error with zero threshold :
{'FP': 11, 'TN': 89, 'FN': 12, 'TP': 88}
Info : Training Error of the final classifier : 0.0
Info : Validation Error of the final classifier : 0.105
Info : Testing Error of the final classifier : 0.115

A quick note, if you installed pboost with sudo privileges, you may need to change the ownership of pboost folder and the files in it. In linux, run the following command

sudo chown -R {YOUR_USERNAME} ~/pboost


C.CREATING YOUR CONFIGURATION FILE
-------------------------------------------------------------------------------
The recommended way is appending your new configurations to configurations.cfg file in pboost folder. Alternatively you can create an empty text file with the extension of cfg. A shorter explanation of options can be found in configurations.cfg file. For the detailed explanations see below:

# train_file = [HDF5 File]
# Raw file of examples and their attributes for training
# (Must be specified)
#
# test_file = [HDF5 File]
# Raw file of examples and their attributes for testing
# (Optional)

Data files (train_file and test_file) should be in HDF5 format with raw data dataset must be named 'data' and label must be named 'label'. Alternatively one can use the last column of 'data' dataset as the label.

If a separate 'label' dataset is not given, the program will assume the last column as labels. At this point only binary labeling is supported; therefore, all the labels should be zero or one.

If there is no separate test dataset, please leave test_data option blank.

# factory_files = [.py File(s), separated by comma]
# File(s) containing the concrete implementations of
# the BaseFeatureFactory class
# (Leaving blank will include default implementation)

factory_files option refers to the files containing user defined feature factories. See examples of Feature Factory classes in diabetes_feature_factory.py file in demo folder.

In order to use each column of your raw data as a feature, leave this option blank or use the keyword default.

More than one options can be specified, for instance the following is perfectly okay.
factory_files = diabetes_feature_factory.py,default

Each user defined Feature Factory class should inherit from BaseFeatureFactory and override blueprint and make methods. The blueprint method refers to the functionality of features. The produce method is about giving arguments to actually create features. One should call make method for every single feature creation within that method. All Feature Factory Files should be in the working directory.

# working_dir = [string]
# Path to a shared disk location containing data files, factory files
# (Leaving blank will set it to the same directory as this file)

Working directory should be on a shared disk space. Data files and feature factory files should be put in that directory.

# algorithm = [string]
# Boosting algorithm to use.
# Current options are "conf-rated" and "adaboost"
# (Leaving blank will set this field to "conf-rated")

Currently only "Adaboost" and "Confidence Rated Boosting" are supported. The second one is recommended as it is likely to converge faster.

# rounds = [integer]
# Number of rounds that the boosting algorithm should iterate for
# (Leaving blank will set this number to 100)

The runtime of the program is directly proportional to the number of rounds. It is recommended not give a very large number; training error of boosting usually converges with a moderate number of rounds and a large number might cause over fitting.

# xval_no = [integer]
# Number of cross-validation folds
# (Leaving blank will set it to 1, that is disabling x validation)

In order to have k-fold cross validation, set this number to k. Cross validation is useful when you do not have a separate testing dataset. In this case, validation error might be a good estimate of testing error. A large number should be avoided, as it will increase the runtime. For k-fold cross validation pboost will train k+1 algorithms, meaning that it will take k times longer compared to not using cross validation.

# max_memory = [integer/float]
# Amount of available RAM to each core, in GigaBytes
# (Leaving blank will set this number to 2, that is 2GB per core)

Since MPI does not treat differently to cores on the same node, this option should be specified in per core basis.

# show_plots = [y/n]
# Enable to show the plots before the program terminates
# (Leaving blank will set this field to y)


All the necessary data is always stored in output folder and plots can be generated later on.

D. RUNNING YOUR PROGRAM
-------------------------------------------------------------------------------
run.py takes two arguments cp and cn
cn : Configuration numbers to process
cp (optional) : Path to the configuration file

A typical command to run pboost looks like

mpirun -np NUMBER_OF_PROCESSORS python run.py -cp CONF_PATH CONF_NUM_1 CONF_NUM2

The default setting for cp is '~/pboost/configurations.cfg'

The output of the program will be in the folder out_{CONF_NUM} in working directory. The output folder contains the following files in it:

feature factory files : All Feature Factory Files necessary to make necessary features
final_classifier.py : A script to classify any given raw data
hypotheses.npy : Data file containing hypotheses found by pboost
dump.npz : Data file containing predictions and other information about the job

One can generate plots using plot.py and giving dump.npz as input to that script

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pboost-0.1.2.tar.gz (943.8 kB view hashes)

Uploaded Source

Built Distribution

pboost-0.1.2-py2.7.egg (990.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page