tanbih-pipeline

a pipeline framework for streaming processing

These details have not been verified by PyPI

Project links

Homepage

Project description

https://badge.fury.io/py/tanbih-pipeline.svg

Pipeline provides an unified interface to set up data stream processing systems with Kafka, Pulsar, RabbitMQ, Redis and many more. The idea is to free developer from the dynamic change of technology in deployment, so that a docker image released for a certain task can be used with Kafka or Redis through changes of environment variables.

Features

a unified interface from Kakfa to Pulsar, from Redis to MongoDB
components connection controlled via command line, or environment variables
support file and in-memory for testing

Requirements

Python 3.7, 3.8

Installation

$ pip install tanbih-pipeline

You can install the required backend dependencies with:

$ pip install tanbih-pipeline[redis]
$ pip install tanbih-pipeline[kafka]
$ pip install tanbih-pipeline[pulsar]
$ pip install tanbih-pipeline[rabbitmq]
$ pip install tanbih-pipeline[elastic]
$ pip install tanbih-pipeline[mongodb]

If you want to support all backends, you can:

$ pip install tanbih-pipeline[full]

Producer

Producer is to be used when developing a data source in our pipeline. A source will produce output without input. A crawler can be seen as a producer.

>>> from typing import Generator
>>> from pydantic import BaseModel
>>> from pipeline import Producer as Worker, ProducerSettings as Settings
>>>
>>> class Output(BaseModel):
...     key: int
>>>
>>> class MyProducer(Worker):
...     def generate(self) -> Generator[Output, None, None]:
...         for i in range(10):
...             yield Output(key=i)
>>>
>>> settings = Settings(name='producer', version='0.0.0', description='')
>>> producer = MyProducer(settings, output_class=Output)
>>> producer.parse_args("--out-kind MEM --out-topic test".split())
>>> producer.start()
>>> [r.get('key') for r in producer.destination.results]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Processor

Processor is to be used to process input. Modification will be in-place. A processor can produce one output for each input, or no output.

>>> from pydantic import BaseModel
>>> from pipeline import Processor as Worker, ProcessorSettings as Settings
>>>
>>> class Input(BaseModel):
...     temperature: float
>>>
>>> class Output(BaseModel):
...     is_hot: bool
>>>
>>> class MyProcessor(Worker):
...     def process(self, content, key):
...         is_hot = (content.temperature > 25)
...         return Output(is_hot=is_hot)
>>>
>>> settings = Settings(name='processor', version='0.1.0', description='')
>>> processor = MyProcessor(settings, input_class=Input, output_class=Output)
>>> args = "--in-kind MEM --in-topic test --out-kind MEM --out-topic test".split()
>>> processor.parse_args(args)
>>> processor.start()

Splitter

Splitter is to be used when writing to multiple outputs. It will take a function to generate output topic based on the processing message, and use it when writing output.

>>> from pipeline import Splitter as Worker, SplitterSettings as Settings
>>>
>>> class MySplitter(Worker):
...     def get_topic(self, msg):
...         return '{}-{}'.format(self.destination.topic, msg.get('id'))
>>>
>>> settings = Settings(name='splitter', version='0.1.0', description='')
>>> splitter = MySplitter(settings)
>>> args = "--in-kind MEM --in-topic test --out-kind MEM --out-topic test".split()
>>> splitter.parse_args(args)
>>> splitter.start()

Usage

Choosing backend technology:

kind	description	multi- reader	shared reader	data expire
LREDIS	Redis List	X	X	read
XREDIS	Redis Stream	X	X	limit
KAFKA	Kafka	X	X	read
PULSAR	Pulsar	X	X	ttl
RABBITMQ	RabbitMQ	X		read
ELASTIC	ElasticSearch
MONGODB	MongoDB
FILE*	json,csv
MEM*	memory

FILE accepts jsonl input on stdin and with filename, it also accepts csv file. Both format can be gzipped.
MEM read and write to memory, designed for unit tests.

# check command line arguments for certain input and output
worker.py --in-kind FILE --help
# or
IN_KIND=FILE worker.py
# or
export IN_KIND=FILE
worker.py --help

# process input from file and output to stdout (--in-content-only is
# needed for this version)
worker.py --in-kind FILE --in-filename data.jsonl --in-content-only \
          --out-kind FILE --out-filename -


# read from file and write to KAFKA
worker.py --in-kind FILE --in-filename data.jsonl --in-content-only \
          --out-kind KAFKA --out-namespace test --out-topic articles \
          --out-kafka kafka_url --out-config kafka_config_json

Arguments

common

debug monitoring

kind namespace topic

input:

FILE

Scripts

pipeline-copy is a script to copy data from a source to a destination. It can be used to inject data from a file to a database, or from a database to another database. It is implemented as a Pipeline worker.

Since JSON format does not support datetimes, in order for pipeline-copy to treat datetime field as datetime instead of string, you can provide a model definition via argument –model-definition. An example of such model definition is as following (the class name needs to be Model):

from datetime import datetime
from typing import Optional

from pydantic import BaseModel

class Model(BaseModel):
    hashtag: str
    username: str
    text: str
    tweet_id: str
    location: Optional[str]
    created_at: datetime
    retweet_count: int

Environment Variables

Application accepts following environment variables (Please note, you will need to add prefix IN_, –in- and OUT_, –out- to these variables to indicate the option for input and output). Please refer to backend documentation for available arguments/environment variables.

Customize Settings

class CustomSettings(Settings):
    new_argument: str = Field("", title="a new argument for custom settings")

class CustomProcessor(Processor):
    def __init__(self):
        settings = CustomSettings("worker", "v0.1.0", "custom processor")
        super().__init__(settings, input_class=BaseModel, output_class=BaseModel)

Errors

PipelineError will be raised when error occurs

Contribute

Use pre-commit to run black and flake8

Credits

Yifan Zhang (yzhang at hbku.edu.qa)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.18

Sep 28, 2024

0.12.17

Sep 28, 2024

0.12.16

Sep 28, 2024

0.12.15

Sep 28, 2024

0.12.14

Sep 28, 2024

0.12.13

Oct 8, 2023

0.12.11

Aug 7, 2023

0.12.10

May 18, 2023

0.12.9

Jun 28, 2022

0.12.8

Jun 21, 2022

0.12.7

Jun 15, 2022

0.12.6

May 26, 2022

0.12.5

May 25, 2022

0.12.4

May 18, 2022

0.12.1

Mar 20, 2022

This version

0.11.33

Dec 23, 2021

0.11.32

Dec 16, 2021

0.11.31

Dec 9, 2021

0.11.30

Dec 6, 2021

0.11.28

Nov 28, 2021

0.11.27

Nov 23, 2021

0.11.22

Nov 8, 2021

0.11.21

Oct 12, 2021

0.11.20

Oct 11, 2021

0.11.19

Oct 11, 2021

0.11.18

Oct 11, 2021

0.11.17

Jul 8, 2021

0.11.15

Jul 1, 2021

0.11.14

Jul 1, 2021

0.11.13

Jul 1, 2021

0.11.12

Jul 1, 2021

0.11.11

Jul 1, 2021

0.11.10

Jul 1, 2021

0.11.9

Jun 8, 2021

0.11.8

May 25, 2021

0.11.7

May 20, 2021

0.11.6

May 20, 2021

0.11.5

May 20, 2021

0.11.4

May 20, 2021

0.11.3

May 19, 2021

0.11.2

May 9, 2021

0.11.1

May 6, 2021

0.11.0

May 2, 2021

0.10.3

Sep 13, 2021

0.10.2

Sep 13, 2021

0.10.1

Mar 22, 2021

0.10.0

Mar 17, 2021

0.9.2

Mar 12, 2021

0.9.1

Jan 26, 2021

0.8.7

Dec 23, 2020

0.8.6

Dec 23, 2020

0.8.5

Dec 21, 2020

0.8.4

Dec 16, 2020

0.8.3

Dec 16, 2020

0.8.2

Dec 16, 2020

0.8.1

Dec 14, 2020

0.7.6

Dec 10, 2020

0.7.5

Dec 2, 2020

0.7.4

Oct 27, 2020

0.7.3

Oct 14, 2020

0.7.2

Oct 11, 2020

0.7.0

Aug 4, 2020

0.6.1

Jul 30, 2020

0.6.0

Jul 30, 2020

0.5.4

Jul 29, 2020

0.5.3

Jul 26, 2020

0.5.2

Jul 26, 2020

0.5.1

Jul 25, 2020

0.5.0

Jul 24, 2020

0.4.3

Jul 19, 2020

0.4.2

Jul 7, 2020

0.4.1

Jul 7, 2020

0.4.0

Jul 7, 2020

0.3.3

Jul 7, 2020

0.3.2

Jul 6, 2020

0.3.1

Jun 28, 2020

0.3.0

Jun 28, 2020

0.2.0

Jun 28, 2020

0.1.4

Jun 24, 2020

0.1.3

Jun 23, 2020

0.1.1

Jun 7, 2020

0.1.0

Jun 3, 2020

0.0.3

May 22, 2020

0.0.2

May 21, 2020

0.0.1

Apr 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tanbih-pipeline-0.11.33.tar.gz (350.1 kB view details)

Uploaded Dec 23, 2021 Source

Built Distribution

tanbih_pipeline-0.11.33-py3-none-any.whl (768.6 kB view details)

Uploaded Dec 23, 2021 Python 3

File details

Details for the file tanbih-pipeline-0.11.33.tar.gz.

File metadata

Download URL: tanbih-pipeline-0.11.33.tar.gz
Upload date: Dec 23, 2021
Size: 350.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.2

File hashes

Hashes for tanbih-pipeline-0.11.33.tar.gz
Algorithm	Hash digest
SHA256	`43a4410114fac2588794012a36604f742dd1014713ad3b3a101ad6d818a93d51`
MD5	`a3dfbeee1a4ee582eea4591ee88e158d`
BLAKE2b-256	`2f1e3bf29a0984a543f1cf671eed1fe7906fff1b62d27f670652189176c287a5`

See more details on using hashes here.

File details

Details for the file tanbih_pipeline-0.11.33-py3-none-any.whl.

File metadata

Download URL: tanbih_pipeline-0.11.33-py3-none-any.whl
Upload date: Dec 23, 2021
Size: 768.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.3 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.2

File hashes

Hashes for tanbih_pipeline-0.11.33-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cfee84f78a87167db3f274117b6872b339ad76c0cce2318de52e15d1db318272`
MD5	`05be31877debc312f1a11676a8927669`
BLAKE2b-256	`cc32e652f2d96b77f53d40a52601e417985a4c6e82e18b93a79d1604d9156b11`