Python library for Hadoop Streaming with support of protobuf sequences
Project description
ProtoSeq
ProtoSeq is a python library that allows working with sequences of specified protobuf messages. The sequence of protobuf messages is stored as a sequence of pairs:
- size of message in bytes – 4 bytes (int);
- protobuf message bytes.
This sequence format is a flexible storage format similar to Hadoop SequenceFile that allows to process files with multiprocessing (e.g. with Hadoop) if extra index is provided.
This repository is an example how to work with binary data using Hadoop Streaming.
Quick Start
Install package with pip: pip install protoseq
.
This is an example program that reads file in protoseq format, saves it to temprorary file and prints protobufs in human readable format.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys
import address_pb2
from tempfile import TemporaryFile
from protoseq.reader import ProtobufSequenceReader
from protoseq.writer import ProtobufSequenceWriter
with TemporaryFile(mode='wb+') as f_out:
reader = ProtobufSequenceReader(address_pb2.Address, sys.stdin.buffer)
writer = ProtobufSequenceWriter(f_out)
for record in reader:
writer.write(record)
f_out.seek(0)
reader = ProtobufSequenceReader(address_pb2.Address, f_out)
for record in reader:
print(record)
Here record
is an instance of address_pb2.Address
.
This program needs an address_pb2.py
file – generated sources for python. address_pb2.py
can be changed to your own protobuf.
Hadoop Streaming Example
Here is an example of Map-Reduce program (map-only) that copies a file in HDFS.
There are some dependencies we need to run the MR job:
$ tree mapreduce
mapreduce
├── hadoop-streaming-protoseq.jar
├── streaming
│ ├── address_pb2.py
│ └── mapper.py
└── streaming-env-py37.tar.gz
1 directory, 5 files
You can get all these files just running make all
command inside example directory:
hadoop-streaming-protoseq.jar
– ProtoSeq library for Hadoop Streaming;streaming-env-py37.tar.gz
– environment with python3 and installed ProtoSeq package;streaming/mapper.py
– mapper stage for job;streaming/address_pb2.py
– generated protobuf sources for python.
You supposed to have conda and conda pack to prepare streaming-env-py37.tar.gz
for streaming.
To run MR program we need to execute command:
${HADOOP} jar ${HADOOP_STREAMING} \
-D mapred.job.name="Example: Copy proto file" \
-D mapred.reduce.tasks=0 \
-D stream.map.input='rawbytes' \
-D stream.map.input.writer.class='org.apache.hadoop.streaming.io.RawBytesOutputReader' \
-D stream.map.output='rawbytes' \
-D stream.map.output.reader.class='org.apache.hadoop.streaming.io.RawBytesOutputReader' \
-files "streaming/mapper.py" \
-libjars "hadoop-streaming-protoseq.jar" \
-archives "streaming-env-py37.tar.gz#env" \
-inputformat "com.github.vbugaevskii.hadoop.streaming.protobuf.ProtobufSequenceInputFormat" \
-outputformat "com.github.vbugaevskii.hadoop.streaming.protobuf.ProtobufSequenceOutputFormat" \
-mapper "env/bin/python streaming/mapper.py" \
-input "/tmp/v.bugaevskii/addresses.protoseq" \
-output "/tmp/v.bugaevskii/protoseq_copy"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hadoop-protoseq-0.0.1.tar.gz
.
File metadata
- Download URL: hadoop-protoseq-0.0.1.tar.gz
- Upload date:
- Size: 3.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b8bc74f027a3e817aba29627acc021dc1bf8e3807dcafc8f41e0928bf5d4917 |
|
MD5 | 3b3071e581933c7ab8524f061452610f |
|
BLAKE2b-256 | 35d867dbe49b46f7dbd0f2387041583f0008474ae9351508d56610226d395078 |
File details
Details for the file hadoop_protoseq-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: hadoop_protoseq-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac44ffcf022d0e13b21d369eac76fdbd6a7b4861e3c2b79c86ea0b918ab477e0 |
|
MD5 | b9af3fb548161aed905ec38f34a711f1 |
|
BLAKE2b-256 | 7e4633c2e19e2c9a2fd98016278696b59cfb9209e9bd23ed0d7c2184a51d56c7 |