Experimental tools for parallel machine learning
# Pyrallel - Parallel Data Analytics in Python
Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.
- focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
- focus on small to medium data (with data locality when possible).
- focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
- do not focus on HA / Fault Tolerance (yet).
- do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.
Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.
The usual suspects: Python 2.7, NumPy, SciPy.
Fetch the development version (master branch) from:
StarCluster develop branch and its IPCluster plugin is also required to easily startup a bunch of nodes with IPython.parallel setup.
## Patterns currently under investigation
- Asynchronous & randomized hyper-parameters search (a.k.a. Randomized Grid Search) for machine learning models
- Share numerical arrays efficiently over the nodes and make them available to concurrently running Python processes without making copies in memory using memory-mapped files.
- Distributed Random Forests fitting.
- Ensembling heterogeneous library models.
- Parallel implementation of online averaged models using a MPI AllReduce, for instance using MiniBatchKMeans on partitioned data.
See the content of the examples/ folder for more details.