Skip to main content

a pipeline framework for streaming processing

Project description

https://badge.fury.io/py/tanbih-pipeline.svg Documentation Status

Pipeline is a data streaming framework supporting Pulsar/Kafka

Generator

Generator is to be used when developing a data source in our pipeline. A source will produce output without input. A crawler can be seen as a generator.

>>> from pipeline import Generator, Message
>>>
>>> class MyGenerator(Generator):
...     def generate(self):
...         for i in range(10):
...             yield {'id': i}
>>>
>>> generator = MyGenerator('generator', '0.1.0', description='simple generator')
>>> generator.parse_args("--kind MEM --out-topic test".split())
>>> generator.start()
>>> [r.dct['id'] for r in generator.destination.results]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Processor

Processor is to be used to process input. Modification will be in-place. A processor can produce one output for each input, or no output.

>>> from pipeline import Processor, Message
>>>
>>> class MyProcessor(Processor):
...     def process(self, dct_or_dcts):
...         if isinstance(dct_or_dcts, list):
...             print('SHOULD NOT BE HERE')
...         else:
...             dct_or_dcts['processed'] = True
...         return None
>>>
>>> processor = MyProcessor('processor', '0.1.0', description='simple processor')
>>> config = {'data': [{'id': 1}]}
>>> processor.parse_args("--kind MEM --in-topic test --out-topic test".split(), config=config)
>>> processor.start()
>>> [r.dct['id'] for r in processor.destination.results]
[1]

Splitter

Splitter is to be used when writing to multiple outputs. It will take a function to generate output topic based on the processing message, and use it when writing output.

>>> from pipeline import Splitter, Message
>>>
>>> class MySplitter(Splitter):
...     def get_topic(self, dct):
...         return '{}-{}'.format(self.destination.topic, dct['id'])
...
...     def process(self, dct_or_dcts):
...         if isinstance(dct_or_dcts, list):
...             print('SHOULD NOT BE HERE')
...         else:
...             dct_or_dcts['processed'] = True
...         return None
>>>
>>> splitter = MySplitter('splitter', '0.1.0', description='simple splitter')
>>> config = {'data': [{'id': 1}]}
>>> splitter.parse_args("--kind MEM --in-topic test --out-topic test".split(), config=config)
>>> splitter.start()
>>> [r.dct['id'] for r in splitter.destinations['test-1'].results]
[1]

Usage

## Writing a Worker

Choose Generator, Processor or Splitter to subclass from.

## Environment Variables

Application accepts following environment variables:

environment command line variable argument options PIPELINE –kind KAFKA, PULSAR, FILE PULSAR –pulsar pulsar url TENANT –tenant pulsar tenant NAMESPACE –namespace pulsar namespace SUBSCRIPTION –subscription pulsar subscription KAFKA –kafka kafka url GROUPID –group-id kafka group id INTOPIC –in-topic topic to read OUTTOPIC –out-topic topic to write to

## Custom Code

Define add_arguments to add new arguments to worker.

Define setup to run initialization code before worker starts processing messages. setup is called after command line arguments have been parsed. Logic based on options (parsed arguments) goes here.

## Options

## Errors

The value None above is error you should return if dct or dcts is empty. Error will be sent to topic errors with worker information.

Credits

Yifan Zhang (yzhang at hbku.edu.qa)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tanbih-pipeline-0.0.3.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

tanbih_pipeline-0.0.3-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file tanbih-pipeline-0.0.3.tar.gz.

File metadata

  • Download URL: tanbih-pipeline-0.0.3.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for tanbih-pipeline-0.0.3.tar.gz
Algorithm Hash digest
SHA256 dffe85bd0af9b2f3787a4fffa08290c1144cbee4e4d535884a0a4e5bd5400ed8
MD5 4dc41148a10f05122fbe5962c29485c5
BLAKE2b-256 370aa875d1a6cc3638d22c6954e070b7d20094eac3ba1ca8515bc389225be884

See more details on using hashes here.

File details

Details for the file tanbih_pipeline-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: tanbih_pipeline-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for tanbih_pipeline-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7edc974fccae9a280e6e4074ccfc5959345c0a4a259a5251a699ee1cb485247f
MD5 809d6a05e576b636ea5c6ae7b9d23ada
BLAKE2b-256 275815bf34d6e7f04aa896ab0762257faeaca005393bc990745149c2d684a4c5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page