Python MapReduce framework
mrjob is a Python 2.6+/3.3+ package that helps you write and run Hadoop Streaming jobs.
mrjob fully supports Amazon’s Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc) which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own Hadoop cluster.
Some important features:
pip install mrjob
python setup.py install
Code for this example and more live in mrjob/examples.
"""The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()
# locally python mrjob/examples/mr_word_freq_count.py README.rst > counts # on EMR python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts # on Dataproc python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts # on your Hadoop cluster python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts
To run in other AWS regions, upload your source tree, run make, and use other advanced mrjob features, you’ll need to set up mrjob.conf. mrjob looks for its conf file in:
See the mrjob.conf documentation for more information.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|File Name & Checksum SHA256 Checksum Help||Version||File Type||Upload Date|
|mrjob-0.5.10-py2.py3-none-any.whl (305.5 kB) Copy SHA256 Checksum SHA256||2.7||Wheel||May 12, 2017|
|mrjob-0.5.10.tar.gz (439.4 kB) Copy SHA256 Checksum SHA256||–||Source||May 12, 2017|