A system and language to handle any process using multiple workers for some(planned for most) languages

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Pipeline

Pipeline is an asynchronous ETL (Extract, Transform, Load) system that uses a custom scripting language to run code across multiple servers, one step at a time. It's designed for efficient handling of large-scale data processing tasks, particularly those involving APIs with long wait times or I/O-heavy workloads.

Features
Installation
Quick Start
Supported Languages
Configuration
Usage
Learn by Example
Performance
Future of Pipeline
License

Features

Asynchronous execution of code across multiple servers
Custom scripting language for defining ETL pipelines
Support for Python, SQLite3, and PostgreSQL
Efficient handling of APIs with long wait times
Optimized for I/O-heavy workloads
Scalable architecture for processing large amounts of data

Installation

Clone the repository: git clone https://github.com/yourusername/pipeline.git cd pipeline
Install required packages: pip install -r requirements.txt
(Optional) Build Cython files: python build.py (This can give a 3x performance boost)
(Optional) Configure PostgreSQL settings in the .env file.

Quick Start

Run the demo server: python demo.py
In a separate terminal, run the example uploading code: python example.py

Supported Languages

Python
SQLite3
PostgreSQL

Configuration

Setup at least 4 servers on a private network (they can be small, you can technically run all these on one server like demo.py does but that's not recommended)
Create a server running python bucket.py or something like python -c "import c_bucket;c_bucket.main()"
Create a server running python pipeline.py or something like python -c "import c_pipeline;c_pipeline.main()"
Create a server running python worker.py or something like python -c "import c_worker;c_worker.main()"
Edit the .env on each server to access the private ip. Change PIPE_WORKER_HOST to refer to the server running pipeline.py on server running worker.py and change BUCKET_CLIENT_HOST to refer to the server running bucket.py on both the worker.py server and the pipeline.py server
Add "worker" servers until desired speed
Create a server with private and public network access and use this to run pipeline.upload_pipe_code_from_file or pipeline.upload_pipe_code uploading the script to the server to be run.
All workers must also have the files necessary to run your code, pip installs and all
(Optionally) The PIPE_WORKER_SUBPROCESS_JOBS value within the .env file can be set to true or false(really anything but true). This configuration lets you run python code in a subprocess or within the "worker" script. Setting it to false gives a very slight performance increase, but requires you restart the server every time you make a change to your project.

Usage

Pipeline uses a custom scripting language to define ETL processes. Here's how to use it:

Basic Structure

A Pipeline script consists of steps and pipes. Each step defines a task, and pipes determine the order of execution.

# Step definition
step_name:
    language
    function_or_table_name
    source_file_or_code

# Pipe definition
pipe_name = step1 | step2 | step3

# Execution
pipe_name()

Supported Languages

python: For Python code
sqlite3: For SQLite queries
postgres: For PostgreSQL queries

Learn by Example

# the default scope is set to `production-small` for all steps (imports)
# setting scopes is how you make new steps with errors
# not slow down your servers by setting them to a lower scope.
# And/or how you handle processes that either require and do not require big machines to run
$ production-small

# step 1: `accounts`
accounts:
    python  # <-- select the language to be run. currently only python, sqlite3 and postgres are available
    accounts  # define the function or table name that will be used
    example.py  # either provide a file or write code directly using the "`" char (see below example)

request:
    python
    request_report
    example.py

status:
    python
    $ testing-small  # <-- "scope" for a single step. A lower scope will be given less priority over higher scopes. See PIPE_WORKER_SCOPES in `.env` file
    get_status
    example.py

download:
    python
    !9  # <-- "priority" higher numbers are more important and run first within their scope.
    get_report
    example.py

manipulate_data:
    sqlite3
    some_table  # *vvvv* see below for writing code directly *vvvv*
    `
SELECT
    *,
    CASE
        WHEN sales = 0
        THEN 0.0
        ELSE spend / sales
    END AS acos
FROM some_table
`

## this one's just to show postgres as well
#manipulate_data_again:
#    postgres
#    another_table
#    `
#select
#    *,
#    case
#        when spend = 0
#        then 0.0
#        else sales / spend
#    end AS roas
#from another_table
#`

upload:
    python
    upload_to_db
    example.py


# these are pipes and what will tell the server what order to run the steps
# and also transfer the returned  data between steps
# each step will be run individually and could be run on a different computer each time
accounts_pipe = | accounts  # single pipes currently need a `|` before or behind the value
api_pipe = request | status | download | manipulate_data | upload


# currently there are only two syntax's for "running" pipes.
# either by itself:
# pipe()
#
# or in a loop:
# for value in pipe1():
#     pipe2(value)

# # Another Example:
# v = pipe(accounts_pipe)  # <-- single call
# pipe2(v)

# right not you cannot pass arguments within the pipe being used for the for loop.
# in this case `accounts_pipe()` cannot be `accounts_pipe(some_value)`
for account in accounts_pipe():
    api_pipe(account)

Scopes and Priorities

Use scopes and priorities to control execution:

$ production  # Set default scope


step_name:
    python
    !9  # Set priority (higher numbers run first within their scope)
    $ testing     # Set a lower priority scope
    function_name
    source_file

Writing Code Directly

For short snippets, you can write code directly in the script:

step_name:
    sqlite3
    table_name
    `
    SELECT * FROM table_name
    WHERE condition = 'value'
    `

Defining Pipes

Pipes determine the order of step execution:

single_pipe = | step1  # or `step1 |`
normal_pipe = step1 | step2 | step3

Executing Pipes

There are two ways to execute pipes:

Single call

pipe1()
result1 = pipe2()
result2 = pipe3(result1)
pipe4(result2)

pipe5(result1, result2)

# incorrect --> `pipe3(pipe2())`  #  this syntax is currently not supported
# also incorrect, they must be on one line as of now:
# `pipe3(
#   result1
# )`

Looped execution

for item in pipe1():
    pipe2(item)
# incorrect --> `for item in pipe1(result):`  # syntax not supported for now

Running Your Pipeline

Save your pipeline script as a .pipe file.
Use the Pipeline API to upload and run your script:

# example.py
import pipeline

pipeline.upload_pipe_code_from_file('your_script.pipe')

Performance

Pipeline is specifically designed to handle I/O-heavy workloads efficiently. It excels in scenarios such as:

Making numerous API calls, especially to services with long processing times
Handling large-scale data transfers between different systems
Concurrent database operations

For instance, Pipeline is currently being used by an agency to request 30,000 reports daily from the Amazon Ads API, resulting in at least 90,000 API calls per day. This process, which includes pushing data into a PostgreSQL server with over 600 GB of data, is completed within a few hours(adding more workers could make this alot faster). The system's efficiency allows for this level of performance at a cost of under $100, including database expenses, actually the servers requesting the data are about $25.

The asynchronous nature of Pipeline makes it particularly suited for APIs like Amazon Ads, where there are significant wait times between requesting a report and its availability for download. Traditional synchronous ETL processes struggle with such APIs, especially for agencies with numerous profiles.

Plans

If this projects sees some love, or I just find more free time, I'd like to support more languages. Even compiled languages such as rust, go and c++. Allowing teams that write different languages to work on the same program.

Turning this project into a pip package.

I want to rewrite this in rust for performance.

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

1.0.65

Sep 25, 2024

1.0.65a15 pre-release

Sep 27, 2024

1.0.65a14 pre-release

Sep 26, 2024

1.0.65a13 pre-release

Sep 26, 2024

1.0.65a11 pre-release

Sep 26, 2024

1.0.65a10 pre-release

Sep 26, 2024

1.0.65a9 pre-release

Sep 26, 2024

1.0.65a8 pre-release

Sep 25, 2024

1.0.65a7 pre-release

Sep 25, 2024

1.0.65a5 pre-release

Sep 25, 2024

1.0.65a4 pre-release

Sep 25, 2024

1.0.65a3 pre-release

Sep 25, 2024

1.0.65a2 pre-release

Sep 25, 2024

1.0.65a1 pre-release

Sep 25, 2024

1.0.64

Sep 25, 2024

1.0.64a8 pre-release

Sep 25, 2024

1.0.64a7 pre-release

Sep 25, 2024

1.0.64a5 pre-release

Sep 25, 2024

1.0.64a3 pre-release

Sep 25, 2024

1.0.64a2 pre-release

Sep 25, 2024

1.0.64a1 pre-release

Sep 25, 2024

1.0.63

Sep 16, 2024

1.0.63a27 pre-release

Sep 24, 2024

1.0.63a26 pre-release

Sep 24, 2024

1.0.63a25 pre-release

Sep 24, 2024

1.0.63a24 pre-release

Sep 23, 2024

1.0.63a23 pre-release

Sep 22, 2024

1.0.63a22 pre-release

Sep 20, 2024

1.0.63a21 pre-release

Sep 18, 2024

1.0.63a20 pre-release

Sep 18, 2024

1.0.63a19 pre-release

Sep 18, 2024

1.0.63a18 pre-release

Sep 18, 2024

1.0.63a16 pre-release

Sep 18, 2024

1.0.63a15 pre-release

Sep 18, 2024

1.0.63a14 pre-release

Sep 18, 2024

1.0.63a13 pre-release

Sep 18, 2024

1.0.63a12 pre-release

Sep 18, 2024

1.0.63a11 pre-release

Sep 18, 2024

1.0.63a10 pre-release

Sep 17, 2024

1.0.63a9 pre-release

Sep 17, 2024

1.0.63a8 pre-release

Sep 17, 2024

1.0.63a7 pre-release

Sep 17, 2024

1.0.63a6 pre-release

Sep 17, 2024

1.0.63a1 pre-release

Sep 17, 2024

1.0.62

Sep 16, 2024

1.0.61

Sep 16, 2024

1.0.60

Sep 12, 2024

1.0.60a4 pre-release

Sep 16, 2024

1.0.60a3 pre-release

Sep 13, 2024

1.0.60a2 pre-release

Sep 13, 2024

1.0.60a1 pre-release

Sep 12, 2024

1.0.58

Sep 11, 2024

1.0.57

Sep 7, 2024

1.0.57a4 pre-release

Sep 11, 2024

1.0.57a3 pre-release

Sep 8, 2024

1.0.57a2 pre-release

Sep 8, 2024

1.0.57a1 pre-release

Sep 8, 2024

1.0.46

Sep 6, 2024

1.0.46a13 pre-release

Sep 7, 2024

1.0.46a12 pre-release

Sep 7, 2024

1.0.46a11 pre-release

Sep 7, 2024

1.0.46a10 pre-release

Sep 7, 2024

1.0.46a9 pre-release

Sep 7, 2024

1.0.46a8 pre-release

Sep 7, 2024

1.0.46a7 pre-release

Sep 7, 2024

1.0.46a6 pre-release

Sep 7, 2024

1.0.46a5 pre-release

Sep 6, 2024

1.0.46a4 pre-release

Sep 6, 2024

1.0.46a3 pre-release

Sep 6, 2024

1.0.46a2 pre-release

Sep 6, 2024

1.0.46a1 pre-release

Sep 6, 2024

1.0.45

Sep 4, 2024

1.0.45a9 pre-release

Sep 6, 2024

1.0.45a8 pre-release

Sep 6, 2024

1.0.45a7 pre-release

Sep 5, 2024

1.0.45a6 pre-release

Sep 5, 2024

1.0.45a5 pre-release

Sep 5, 2024

1.0.45a4 pre-release

Sep 5, 2024

1.0.45a3 pre-release

Sep 4, 2024

1.0.45a2 pre-release

Sep 4, 2024

1.0.45a1 pre-release

Sep 4, 2024

1.0.44

Sep 3, 2024

1.0.43

Sep 3, 2024

1.0.42

Sep 3, 2024

1.0.41

Sep 2, 2024

1.0.41a1 pre-release

Sep 3, 2024

1.0.41a0 pre-release

Sep 3, 2024

1.0.40

Aug 31, 2024

1.0.39

Aug 31, 2024

1.0.38

Aug 31, 2024

1.0.37

Aug 31, 2024

1.0.36

Aug 29, 2024

1.0.35

Aug 29, 2024

1.0.34

Aug 28, 2024

1.0.33

Aug 28, 2024

1.0.32

Aug 27, 2024

1.0.31

Aug 24, 2024

1.0.30

Aug 24, 2024

1.0.29

Aug 24, 2024

1.0.28

Aug 24, 2024

1.0.27

Aug 23, 2024

1.0.26

Aug 23, 2024

1.0.25

Aug 22, 2024

1.0.24

Aug 22, 2024

1.0.23

Aug 22, 2024

1.0.22

Aug 22, 2024

1.0.21

Aug 22, 2024

1.0.20

Aug 19, 2024

1.0.19

Aug 19, 2024

1.0.18

Aug 18, 2024

1.0.17

Aug 18, 2024

1.0.16

Aug 18, 2024

1.0.15

Aug 18, 2024

1.0.14

Aug 18, 2024

1.0.13

Aug 18, 2024

1.0.12

Aug 17, 2024

1.0.11

Aug 17, 2024

1.0.10

Aug 16, 2024

1.0.9

Aug 15, 2024

1.0.8

Aug 15, 2024

1.0.7

Aug 15, 2024

1.0.6

Aug 15, 2024

1.0.5

Aug 15, 2024

This version

1.0.4

Aug 11, 2024

1.0.3

Aug 11, 2024

1.0.2

Aug 10, 2024

1.0.1

Aug 10, 2024

1.0.0

Aug 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buelon-1.0.4.tar.gz (416.2 kB view hashes)

Uploaded Aug 11, 2024 Source

Built Distribution

buelon-1.0.4-cp311-cp311-macosx_10_9_universal2.whl (1.0 MB view hashes)

Uploaded Aug 11, 2024 CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

Hashes for buelon-1.0.4.tar.gz

Hashes for buelon-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`1e4981ce1e929f5d2f41dcfd39360c2e0296a3fdd9bcf1deedbe3f4b3bfd72ba`
MD5	`0d1c44acb0b5df3d94c04405b595c495`
BLAKE2b-256	`8ad8c1984ef8e29b1bc6ebf56bb420659702280d761d49e0fedfe5c2b62f10f0`

Hashes for buelon-1.0.4-cp311-cp311-macosx_10_9_universal2.whl

Hashes for buelon-1.0.4-cp311-cp311-macosx_10_9_universal2.whl
Algorithm	Hash digest
SHA256	`9edcd8bd0182815d63b4810ba9fdb948f049a8ed810b0363f58b3e5eaa068bb0`
MD5	`dfb3c251b71779c82bc6b38c359e74b4`
BLAKE2b-256	`bb15e8e6a6837804625a041ebdf184923e0940ccb103cc0969eb4726b539cd71`