Skip to main content

a pipeline framework for streaming processing

Project description

https://badge.fury.io/py/tanbih-pipeline.svg Documentation Status

a flexible stream processing framework supporting RabbitMQ, Pulsar, Kafka and Redis.

Features

  • at-least-once guaranteed with acknowledgement on every message

  • horizontally scalable through consumer groups

  • flow is controlled in deployment, develop it once, use it everywhere

  • testability provided with FILE and MEMORY input/output

Installation

$ pip install tanbih-pipeline

You can install the required backend dependencies with:

$ pip install tanbih-pipeline[redis]
$ pip install tanbih-pipeline[kafka]
$ pip install tanbih-pipeline[pulsar]
$ pip install tanbih-pipeline[rabbitmq]
$ pip install tanbih-pipeline[azure]

If you want to support all backends, you can:

$ pip install tanbih-pipeline[full]

Generator

Generator is to be used when developing a data source in our pipeline. A source will produce output without input. A crawler can be seen as a generator.

>>> from pipeline import Generator, Message
>>>
>>> class MyGenerator(Generator):
...     def generate(self):
...         for i in range(10):
...             yield {'id': i}
>>>
>>> generator = MyGenerator('generator', '0.1.0', description='simple generator')
>>> generator.parse_args("--kind MEM --out-topic test".split())
>>> generator.start()
>>> [r.get('id') for r in generator.destination.results]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Processor

Processor is to be used to process input. Modification will be in-place. A processor can produce one output for each input, or no output.

>>> from pipeline import Processor, Message
>>>
>>> class MyProcessor(Processor):
...     def process(self, msg):
...         msg.update({'processed': True})
...         return None
>>>
>>> processor = MyProcessor('processor', '0.1.0', description='simple processor')
>>> config = {'data': [{'id': 1}]}
>>> processor.parse_args("--kind MEM --in-topic test --out-topic test".split(), config=config)
>>> processor.start()
>>> [r.get('id') for r in processor.destination.results]
[1]

Splitter

Splitter is to be used when writing to multiple outputs. It will take a function to generate output topic based on the processing message, and use it when writing output.

>>> from pipeline import Splitter, Message
>>>
>>> class MySplitter(Splitter):
...     def get_topic(self, msg):
...         return '{}-{}'.format(self.destination.topic, msg.get('id'))
...
...     def process(self, msg):
...         msg.update({
...             'processed': True,
...         })
...         return None
>>>
>>> splitter = MySplitter('splitter', '0.1.0', description='simple splitter')
>>> config = {'data': [{'id': 1}]}
>>> splitter.parse_args("--kind MEM --in-topic test --out-topic test".split(), config=config)
>>> splitter.start()
>>> [r.get('id') for r in splitter.destinations['test-1'].results]
[1]

Usage

Writing a Worker

Choose Generator, Processor or Splitter to subclass from.

Environment Variables

Application accepts following environment variables:

environment variable

command line argument

options

PIPELINE

–kind

KAFKA, PULSAR, FILE

PULSAR

–pulsar

pulsar url

TENANT

–tenant

pulsar tenant

NAMESPACE

–namespace

pulsar namespace

SUBSCRIPTION

–subscription

pulsar subscription

KAFKA

–kafka

kafka url

GROUPID

–group-id

kafka group id

INTOPIC

–in-topic

topic to read

OUTTOPIC

–out-topic

topic to write to

Custom Code

Define add_arguments to add new arguments to worker.

Define setup to run initialization code before worker starts processing messages. setup is called after command line arguments have been parsed. Logic based on options (parsed arguments) goes here.

Options

Errors

The value None above is error you should return if dct or dcts is empty. Error will be sent to topic errors with worker information.

Contribute

Use pre-commit to run black and flake8

Credits

Yifan Zhang (yzhang at hbku.edu.qa)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tanbih-pipeline-0.8.7.tar.gz (222.3 kB view details)

Uploaded Source

Built Distribution

tanbih_pipeline-0.8.7-py3-none-any.whl (542.2 kB view details)

Uploaded Python 3

File details

Details for the file tanbih-pipeline-0.8.7.tar.gz.

File metadata

  • Download URL: tanbih-pipeline-0.8.7.tar.gz
  • Upload date:
  • Size: 222.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.2

File hashes

Hashes for tanbih-pipeline-0.8.7.tar.gz
Algorithm Hash digest
SHA256 37e82baa74b7955e4fe96d4788d6d2005e1664362b0c65e721fe854f8b88616d
MD5 006a1c0b3151e2447864a8b89b1daaa1
BLAKE2b-256 32a41a479c4257a39f8aee8fb208c9d9ecd08f22e348b75b4b1041e79d131311

See more details on using hashes here.

File details

Details for the file tanbih_pipeline-0.8.7-py3-none-any.whl.

File metadata

  • Download URL: tanbih_pipeline-0.8.7-py3-none-any.whl
  • Upload date:
  • Size: 542.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.2

File hashes

Hashes for tanbih_pipeline-0.8.7-py3-none-any.whl
Algorithm Hash digest
SHA256 24de97cd4a749623b2a9065da42d6a074612c6dbbf6a032f2163117823d58cfe
MD5 d17c6f3058aa44838e20f04443a6d1cb
BLAKE2b-256 aea12d4d606d1493ef0c368ed95d03a53faa1bbc009796dd2f59924f1a32987b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page