Skip to main content

a pipeline framework for streaming processing

Project description

https://badge.fury.io/py/tanbih-pipeline.svg Documentation Status

a flexible stream processing framework supporting RabbitMQ, Pulsar, Kafka and Redis.

Features

  • at-least-once guaranteed with acknowledgement on every message

  • horizontally scalable through consumer groups

  • flow is controlled in deployment, develop it once, use it everywhere

  • testability provided with FILE and MEMORY input/output

Parameters

  • kind - specify the underlining technology for pipeline, for example, KAFKA or RabbitMQ

  • MEM - Memory based queue (good for unittests)

  • FILE - File based queue (good for development and integration tests)

Generator

Generator is to be used when developing a data source in our pipeline. A source will produce output without input. A crawler can be seen as a generator.

>>> from pipeline import Generator, Message
>>>
>>> class MyGenerator(Generator):
...     def generate(self):
...         for i in range(10):
...             yield {'id': i}
>>>
>>> generator = MyGenerator('generator', '0.1.0', description='simple generator')
>>> generator.parse_args("--kind MEM --out-topic test".split())
>>> generator.start()
>>> [r.get('id') for r in generator.destination.results]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Processor

Processor is to be used to process input. Modification will be in-place. A processor can produce one output for each input, or no output.

>>> from pipeline import Processor, Message
>>>
>>> class MyProcessor(Processor):
...     def process(self, msg):
...         msg.update({'processed': True})
...         return None
>>>
>>> processor = MyProcessor('processor', '0.1.0', description='simple processor')
>>> config = {'data': [{'id': 1}]}
>>> processor.parse_args("--kind MEM --in-topic test --out-topic test".split(), config=config)
>>> processor.start()
>>> [r.get('id') for r in processor.destination.results]
[1]

Splitter

Splitter is to be used when writing to multiple outputs. It will take a function to generate output topic based on the processing message, and use it when writing output.

>>> from pipeline import Splitter, Message
>>>
>>> class MySplitter(Splitter):
...     def get_topic(self, msg):
...         return '{}-{}'.format(self.destination.topic, msg.get('id'))
...
...     def process(self, msg):
...         msg.update({
...             'processed': True,
...         })
...         return None
>>>
>>> splitter = MySplitter('splitter', '0.1.0', description='simple splitter')
>>> config = {'data': [{'id': 1}]}
>>> splitter.parse_args("--kind MEM --in-topic test --out-topic test".split(), config=config)
>>> splitter.start()
>>> [r.get('id') for r in splitter.destinations['test-1'].results]
[1]

Usage

Writing a Worker

Choose Generator, Processor or Splitter to subclass from.

Environment Variables

Application accepts following environment variables:

environment variable

command line argument

options

PIPELINE

–kind

KAFKA, PULSAR, FILE

PULSAR

–pulsar

pulsar url

TENANT

–tenant

pulsar tenant

NAMESPACE

–namespace

pulsar namespace

SUBSCRIPTION

–subscription

pulsar subscription

KAFKA

–kafka

kafka url

GROUPID

–group-id

kafka group id

INTOPIC

–in-topic

topic to read

OUTTOPIC

–out-topic

topic to write to

Custom Code

Define add_arguments to add new arguments to worker.

Define setup to run initialization code before worker starts processing messages. setup is called after command line arguments have been parsed. Logic based on options (parsed arguments) goes here.

Options

Errors

The value None above is error you should return if dct or dcts is empty. Error will be sent to topic errors with worker information.

Contribute

Use pre-commit to run black and flake8

Credits

Yifan Zhang (yzhang at hbku.edu.qa)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tanbih-pipeline-0.8.1.tar.gz (225.9 kB view details)

Uploaded Source

Built Distribution

tanbih_pipeline-0.8.1-py3-none-any.whl (483.5 kB view details)

Uploaded Python 3

File details

Details for the file tanbih-pipeline-0.8.1.tar.gz.

File metadata

  • Download URL: tanbih-pipeline-0.8.1.tar.gz
  • Upload date:
  • Size: 225.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.2

File hashes

Hashes for tanbih-pipeline-0.8.1.tar.gz
Algorithm Hash digest
SHA256 eec4a9d94d85f56ef18a8d5bfe5fc589637c9485c8aca9f255f19849ed80118b
MD5 b63e66502628f19f131de684d3277d44
BLAKE2b-256 3ae581c0b07459ed78722f69c36c42aea5cb628aadcd897d2a63f9024b7d2b8d

See more details on using hashes here.

File details

Details for the file tanbih_pipeline-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: tanbih_pipeline-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 483.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.8.2

File hashes

Hashes for tanbih_pipeline-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 42a9aaef4144977e29084f607c57a4208bb9cf024a4d2f13100dc06a4bb33d08
MD5 302e15ba4e506b8f2af7ec667946c2d7
BLAKE2b-256 8eacef260e5317a039dbe9661d9338d0d27d0e75f041393a966cedb8372a09f2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page