Map/Reduce for Fabric
Because mapping and reducing isn’t supposed to be hard.
What is reductio?
Reductio is a minimalistic map-reduce framework for Python. It runs on top of Fabric and setuptools, which you might already use to get your code onto other machines.
It has no database. It has no distributed filesystem. It uses no server other than sshd. Because of this, it has essentially no memory requirement!
Reductio is designed for disk-bound big data tasks, which many of them are. If the pieces you need to map-reduce fit entirely in the RAM of your worker computers, you are paying a huge premium for that. And if they don’t, a system that tries to buffer things in RAM is going to be wasting all its effort. At some point, you won’t see your data again unless you write it to the damn disk, and that’s what is going to take most of the time.
I created reductio out of absolute necessity, so I could start crunching my data. You might notice its documentation is practically nonexistent at the moment. What Reductio does —————— Reductio extends Fabric (http://fabfile.org). It is meant to support the following approximate process:
- Set up the appropriate Python environment on all your worker machines, including required packages, a place to keep the data, and an up-to-date version of the code you want to run.
- Tell each worker machine how to contact all the other machines, so they can send their data onward (“scatter”) when they’re done with it.
- Give each worker machine some Python functions to run over all the data. These will often be “maps” or “reduces”.
- Collect results and group the things that belong together using Unix’s extremely optimized sort command.
reductio/example/fabfile.py is in example of how to count all the letter bigrams (pairs of adjacent letters) in the ubiquitous “words” file, and aggregate them into a table of frequencies.
If you have a task defined as the function do_stuff in mymodule.py, but first you want to run setup to make sure code and other things are up to date, you would run them both with this command:
fab -f mymodule setup do_stuff
Why not Hadoop?
Face it, if you knew how to configure Hadoop and get your code to run with it, you’d be doing so right now.
Also, Hadoop is written for Java programmers, and Python is distinctly a second-class citizen in its world. Hadoop seems to think all Python code takes the form of standalone scripts with no dependencies, which perhaps says something about what Java programmers think of Python.
Reductio recognizes that none of this is going to work unless you have the right Python setup, so it builds on tools that Python programmers already use to deploy their code.
Why not Disco?
I approve of the Disco project (http://discoproject.org) and its goal of creating a map-reduce ecosystem designed around Python, but I find it too complex and too “magical” at the moment.
It makes it difficult to understand what is going on in its internals, and yet you have to understand its internals when something goes wrong or when you want to do something the designers didn’t expect.