mister

Approachable map/reduce jobs

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 1 - Planning
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 2.7
- Python :: 3

Project description

For all your medium data needs!

When you’ve got data that isn’t really big and so you’re not quite ready to distribute the data across a gazillian machines and stuff but would still like an answer in a reasonable amount of time.

Mister attempts to make running a map/reduce job approachable.

Example

I think word counting is the traditional map/reduce example? So here it is:

import os
import re
improt math
from collections import Counter

from mister import BaseMister


class MrWordCount(BaseMister):
    def prepare(self, count, path):
        """prepare segments the data for the map() method"""
        size = os.path.getsize(path)
        length = int(math.ceil(size / count))
        start = 0
        for x in range(count):
            kwargs = {}
            kwargs["path"] = path
            kwargs["start"] = start
            kwargs["length"] = length
            start += length
            yield (), kwargs

    def map(self, path, start, length):
        """all the magic happens right here"""
        output = Counter()
        with open(path) as fp:
            fp.seek(start, 0)
            words = fp.read(length)

        # I don't compensate for word boundaries because example
        for word in re.split(r"\s+", words):
            output[word] += 1
        return output

    def reduce(self, output, count):
        """take all the return values from map() and aggregate them to the final value"""
        if not output:
            output = Counter()
        output.update(count)
        return output

# let's count the bible
path = "./testdata/bible-kjv.txt"
mr = MrWordCount(path)
wordcounts = mr.run()
print(wordcounts.most_common(10))

On my computer, the asynchronous code above runs about 3x faster than its syncronous equivalent below:

import re
from collections import Counter

path = "./testdata/bible-kjv.txt"

output = Counter()
with open(path) as fp:
    words = fp.read()

for word in re.split(r"\s+", words):
    output[word] += 1

print(wordcounts.most_common(10))

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 1 - Planning
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 2.7
- Python :: 3

Release history Release notifications | RSS feed

0.0.2

Nov 29, 2018

This version

0.0.1

Nov 29, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mister-0.0.1.tar.gz (5.2 kB view hashes)

Uploaded Nov 29, 2018 Source

Hashes for mister-0.0.1.tar.gz

Hashes for mister-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`e97fdbac6bd8ea5e40ad1f4c752ad7898eed85c92d09a8756cf812cd9cfabf9f`
MD5	`94f97b2606fb50d1fb1e247255a09ee1`
BLAKE2b-256	`a526e7e4807b70581516d30046119940948792c1eb0dbe56fd129ac18ae4d36d`