# Pyrallel - Parallel Data Analytics in Python
Overview: experimental project to investigate distributed computation
patterns for machine learning and other semi-interactive data analytics
- focus on small to medium dataset that fits in memory on a small
(10+ nodes) to medium cluster (100+ nodes).
- focus on small to medium data (with data locality when possible).
- focus on CPU bound tasks (e.g. training Random Forests) while trying to
limit disk / network access to a minimum.
- do not focus on HA / Fault Tolerance (yet).
- do not try to invent new set of high level programming abstractions
(yet): use a low level programming model (IPython.parallel) to finely
control the cluster elements and messages transfered and help identify
what are the practical underlying constraints in distributed machine
Disclaimer: the public API of this library will probably not be
stable soon as the current goal of this project is to experiment.
The usual suspects: Python 2.7, NumPy, SciPy.
Fetch the development version (master branch) from:
StarCluster develop branch and its IPCluster plugin is also required
to easily startup a bunch of nodes with IPython.parallel setup.
## Patterns currently under investigation
- Asynchronous & randomized hyper-parameters search (a.k.a. Randomized Grid
Search) for machine learning models
- Share numerical arrays efficiently over the nodes and make them
available to concurrently running Python processes without making
copies in memory using memory-mapped files.
- Distributed Random Forests fitting.
- Ensembling heterogeneous library models.
- Parallel implementation of online averaged models using a MPI AllReduce, for
instance using MiniBatchKMeans on partitioned data.
See the content of the examples/ folder for more details.
This project started at the [PyCon 2012 PyData
as a set of proof of concept [IPython.parallel
TODO: Brief introduction on what you do with files - including link to relevant help section.