Map/Reduce for Fabric
Because mapping and reducing isn’t supposed to be hard.
Reductio is a minimalistic map-reduce framework for Python. It runs on top of Fabric and setuptools, which you might already use to get your code onto other machines.
It has no database. It has no distributed filesystem. It uses no server other than sshd. Because of this, it has essentially no memory requirement!
Reductio is designed for disk-bound big data tasks, which many of them are. If the pieces you need to map-reduce fit entirely in the RAM of your worker computers, you are paying a huge premium for that. And if they don’t, a system that tries to buffer things in RAM is going to be wasting all its effort. At some point, you won’t see your data again unless you write it to the damn disk, and that’s what is going to take most of the time.
I created reductio out of absolute necessity, so I could start crunching my data. You might notice its documentation is practically nonexistent at the moment. What Reductio does —————— Reductio extends Fabric (http://fabfile.org). It is meant to support the following approximate process:
reductio/example/fabfile.py is in example of how to count all the letter bigrams (pairs of adjacent letters) in the ubiquitous “words” file, and aggregate them into a table of frequencies.
If you have a task defined as the function do_stuff in mymodule.py, but first you want to run setup to make sure code and other things are up to date, you would run them both with this command:
fab -f mymodule setup do_stuff
Face it, if you knew how to configure Hadoop and get your code to run with it, you’d be doing so right now.
Also, Hadoop is written for Java programmers, and Python is distinctly a second-class citizen in its world. Hadoop seems to think all Python code takes the form of standalone scripts with no dependencies, which perhaps says something about what Java programmers think of Python.
Reductio recognizes that none of this is going to work unless you have the right Python setup, so it builds on tools that Python programmers already use to deploy their code.
I approve of the Disco project (http://discoproject.org) and its goal of creating a map-reduce ecosystem designed around Python, but I find it too complex and too “magical” at the moment.
It makes it difficult to understand what is going on in its internals, and yet you have to understand its internals when something goes wrong or when you want to do something the designers didn’t expect.