Build and deploy a serverless data pipeline with no effort on AWS.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Datajob

Build and deploy a serverless data pipeline with no effort on AWS.

Deploy your code to a glue job
Package your project and make it available on AWS
Orchestrate your pipeline using stepfunctions as simple as task1 >> [task2,task3] >> task4

Installation

datajob can be installed using pip. Beware that we depend on aws cdk cli!

pip install datajob
npm install -g aws-cdk

Example

A simple data pipeline with 3 Glue python shell tasks that are executed both sequentially and in parallel. See the full example here

from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow


# the datajob_stack is the instance that will result in a cloudformation stack.
# we inject the datajob_stack object through all the resources that we want to add.
with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:

    # here we define 3 glue jobs with the datajob_stack object,
    # a name and the relative path to the source code.
    task1 = GlueJob(
        datajob_stack=datajob_stack,
        name="task1",
        path_to_glue_job="data_pipeline_simple/task1.py",
    )
    task2 = GlueJob(
        datajob_stack=datajob_stack,
        name="task2",
        path_to_glue_job="data_pipeline_simple/task2.py",
    )
    task3 = GlueJob(
        datajob_stack=datajob_stack,
        name="task3",
        path_to_glue_job="data_pipeline_simple/task3.py",
    )

    # we instantiate a step functions workflow and add the sources
    # we want to orchestrate. We got the orchestration idea from
    # airflow where we use a list to run tasks in parallel
    # and we use bit operator '>>' to chain the tasks in our workflow.
    with StepfunctionsWorkflow(
        datajob_stack=datajob_stack,
        name="data-pipeline-simple",
    ) as sfn:
        [task1, task2] >> task3

Deploy and destroy

Deploy your pipeline using a unique identifier --stage and point to the configuration of the pipeline using --config

export AWS_DEFAULT_ACCOUNT=my-account-number
export AWS_PROFILE=my-profile
cd examples/data_pipeline_simple
datajob deploy --stage dev --config datajob_stack.py
datajob destroy --stage dev --config datajob_stack.py

Note: When using datajob cli to deploy a pipeline, we shell out to aws cdk. You can circumvent shelling out to aws cdk by running cdk explicitly. datajob cli prints out the commands it uses in the back to build the pipeline. If you want, you can use those.

cd examples/data_pipeline_simple
cdk deploy --app  "python datajob_stack.py"  -c stage=dev
cdk destroy --app  "python datajob_stack.py"  -c stage=dev

Ideas

trigger a pipeline using the cli; datajob run --pipeline my-simple-pipeline
implement a data bucket, that's used for your pipeline.
add a time based trigger to the step functions workflow.
add an s3 event trigger to the step functions workflow.
add a lambda that copies data from one s3 location to another.
version your data pipeline.
implement sagemaker services
- processing jobs
- hyperparameter tuning jobs
- training jobs
- create sagemaker model
- create sagemaker endpoint
- expose sagemaker endpoint to the internet by levering lambda + api gateway

Any suggestions can be shared by starting a discussion

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.11.0

Nov 22, 2022

0.10.1

Aug 13, 2021

0.10.0

Aug 11, 2021

0.9.0

Jun 23, 2021

0.8.0

Jun 21, 2021

0.7.0

Apr 4, 2021

0.6.1

Feb 25, 2021

0.6.0

Feb 20, 2021

0.5.0

Jan 29, 2021

0.4.0

Jan 7, 2021

This version

0.3.0

Dec 13, 2020

0.2.0

Dec 6, 2020

0.1.0

Dec 6, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datajob-0.3.0.tar.gz (15.1 kB view hashes)

Uploaded Dec 13, 2020 Source

Built Distribution

datajob-0.3.0-py3-none-any.whl (16.9 kB view hashes)

Uploaded Dec 13, 2020 Python 3

Hashes for datajob-0.3.0.tar.gz

Hashes for datajob-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`9f10d32a559517b3e91da5f385739a0583f745dd47a62e628f83c64ccaa7627a`
MD5	`7b21fc7ad5edeb22d4f63f09dc4e4a71`
BLAKE2b-256	`c0f6d5eff99eb03f5edb7e7583bf81416e54e3380111f595af11563a0a294a1d`

Hashes for datajob-0.3.0-py3-none-any.whl

Hashes for datajob-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6cc9220712d6a049a1eac01e5043b9002064e90022b590ad9fe04f7e871ca0f`
MD5	`387e079e5196efd432e6414c78b9b1c2`
BLAKE2b-256	`798fbead56ab9a8c23add9f7dbeed416edfc0f3eb9d3060a1849b2a9247a9306`