Skip to main content

A pure-python highly-distributed MapReduce cluster.

Project description

Having just finished reading the original Google MapReduce paper, I obviously felt the need to try to implement such a system in Python.

My goals are to implement enough of the functionality described in the paper to be usable, though I strongly warn against ever using this code for anything real.

Since one of the goals (see Goals, below) is simplicity from an end-user standpoint, I am following some of Kenneth Reitz’s advice and starting with a readme and documentation.

Examples

The canonical word-count example:

# myjob.py
from pluribus import job


@job.map_
def emit_words(key, value):
    # key: document name
    # value: document contents
    for word in value.split():
        yield word, 1


@job.reduce_
def sum_occurences(key, values):
    # key: a word
    # values: a list of counts
    return sum(values)

Assuming you’re running everything on one host, you can ignore the network connection information.

Start a pluribus master:

$ pluribus master

Start a pluribus worker (or several hundred):

$ pluribus worker

On the master or on another machine that can talk to the master:

$ pluribus job myjob
# ... wait
<results>

Goals

Explicit goals are:

  • Simple to use, both as an administrator and end-user.

  • Well-documented.

  • Robust to worker failure.

  • Fast-enough.

  • Use only the Python (2.7+) standard library (at least to run).

Explicit non-goals are:

  • Be a filesystem.

  • Robust to master failure.

Project details


Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page