Skip to main content

Python library for Hadoop Streaming with support of protobuf sequences

Project description

ProtoSeq

ProtoSeq is a python library that allows working with sequences of specified protobuf messages. The sequence of protobuf messages is stored as a sequence of pairs:

  • size of message in bytes – 4 bytes (int);
  • protobuf message bytes.

This sequence format is a flexible storage format similar to Hadoop SequenceFile that allows to process files with multiprocessing (e.g. with Hadoop) if extra index is provided.

This repository is an example how to work with binary data using Hadoop Streaming.

Quick Start

Install package with pip: pip install protoseq.

This is an example program that reads file in protoseq format, saves it to temprorary file and prints protobufs in human readable format.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sys
import address_pb2

from tempfile import TemporaryFile

from protoseq.reader import ProtobufSequenceReader
from protoseq.writer import ProtobufSequenceWriter

with TemporaryFile(mode='wb+') as f_out:
    reader = ProtobufSequenceReader(address_pb2.Address, sys.stdin.buffer)
    writer = ProtobufSequenceWriter(f_out)

    for record in reader:
        writer.write(record)

    f_out.seek(0)
    reader = ProtobufSequenceReader(address_pb2.Address, f_out)

    for record in reader:
        print(record)

Here record is an instance of address_pb2.Address.

This program needs an address_pb2.py file – generated sources for python. address_pb2.py can be changed to your own protobuf.

Hadoop Streaming Example

Here is an example of Map-Reduce program (map-only) that copies a file in HDFS.

There are some dependencies we need to run the MR job:

$ tree mapreduce
mapreduce
├── hadoop-streaming-protoseq.jar
├── streaming
│   ├── address_pb2.py
│   └── mapper.py
└── streaming-env-py37.tar.gz

1 directory, 5 files

You can get all these files just running make all command inside example directory:

  • hadoop-streaming-protoseq.jar – ProtoSeq library for Hadoop Streaming;
  • streaming-env-py37.tar.gz – environment with python3 and installed ProtoSeq package;
  • streaming/mapper.py – mapper stage for job;
  • streaming/address_pb2.py – generated protobuf sources for python.

You supposed to have conda and conda pack to prepare streaming-env-py37.tar.gz for streaming.

To run MR program we need to execute command:

${HADOOP} jar ${HADOOP_STREAMING} \
    -D mapred.job.name="Example: Copy proto file" \
    -D mapred.reduce.tasks=0 \
    -D stream.map.input='rawbytes' \
    -D stream.map.input.writer.class='org.apache.hadoop.streaming.io.RawBytesOutputReader' \
    -D stream.map.output='rawbytes' \
    -D stream.map.output.reader.class='org.apache.hadoop.streaming.io.RawBytesOutputReader' \
    -files "streaming/mapper.py" \
    -libjars "hadoop-streaming-protoseq.jar" \
    -archives "streaming-env-py37.tar.gz#env" \
    -inputformat  "com.github.vbugaevskii.hadoop.streaming.protobuf.ProtobufSequenceInputFormat" \
    -outputformat "com.github.vbugaevskii.hadoop.streaming.protobuf.ProtobufSequenceOutputFormat" \
    -mapper "env/bin/python streaming/mapper.py" \
    -input  "/tmp/v.bugaevskii/addresses.protoseq" \
    -output "/tmp/v.bugaevskii/protoseq_copy"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hadoop-protoseq-0.0.1.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

hadoop_protoseq-0.0.1-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file hadoop-protoseq-0.0.1.tar.gz.

File metadata

  • Download URL: hadoop-protoseq-0.0.1.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.10

File hashes

Hashes for hadoop-protoseq-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9b8bc74f027a3e817aba29627acc021dc1bf8e3807dcafc8f41e0928bf5d4917
MD5 3b3071e581933c7ab8524f061452610f
BLAKE2b-256 35d867dbe49b46f7dbd0f2387041583f0008474ae9351508d56610226d395078

See more details on using hashes here.

File details

Details for the file hadoop_protoseq-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: hadoop_protoseq-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.7.10

File hashes

Hashes for hadoop_protoseq-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ac44ffcf022d0e13b21d369eac76fdbd6a7b4861e3c2b79c86ea0b918ab477e0
MD5 b9af3fb548161aed905ec38f34a711f1
BLAKE2b-256 7e4633c2e19e2c9a2fd98016278696b59cfb9209e9bd23ed0d7c2184a51d56c7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page