Generic ETL Pipeline Framework for Apache Spark

These details have not been verified by PyPI

Project links

Homepage

Project description

Overview

Goal

spark_etl provide a platform independent way of building spark application.

Benefit

Your application can be moved to different spark platform without change or with very little change.

Supported platform

Local spark via pyspark package.
Spark cluster with Livy Interface
Oracle Dataflow

Concepts

Application

An application is the code for a spark job. It contains:

A main.py file which contain the application entry
A manifest.json file, which specify the metadata of the application.
A requirements.txt file, which specify the application dependency.

See examples/myapp for example.

Build an application

To build an application, run

./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>

<app_dir> is the directory where your application is located.
<build-dir> is the directory where you want your build to be deployed
- Your build actually located at <build-dir>/<version>, where <version> is specified by application's manifest file
Build is mostly platform independent. You need to depend on package oci-core if you intent to use oci Dataflow

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

spark is the spark session object
input_args a dict, is the argument user specified when running this job.
sysops is the system options passed, it is platform specific.
Your main function's return value will be returned from the job submitter to the caller.

def main(spark, input_args, sysops={}):
    # your code here

See here for example.

Deployer

spark_etl support the following deployer
- spark_etl.vendors.local.LocalDeployer
- spark_etl.deployers.HDFSDeployer
- spark_etl.vendors.oracle.DataflowDeployer

the etl.py command use the config file to decide which deployer to use

Job Submitter

spark_etl support the following job submitter
- spark_etl.vendors.local.PySparkJobSubmitter
- spark_etl.job_submitters.livy_job_submitter.LivyJobSubmitter
- spark_etl.vendors.oracle.DataflowJobSubmitter
Job summiter's run function returns the retrun value from job's main function.

the etl.py command use the config file to decide with job submitter to use

Deploy a job using `etl.py` command: (`examples/etl.py`)

./etl.py -a deploy \
    -c <config-filename> \
    --build-dir <build-dir> \
    --deploy-dir <deploy-dir>

-c <config-filename>: this option specify the config file to use for the deployment
--build-dir <build-dir>: this option specify where to look for the build bits to deploy
--deolpy-dir <deploy-dir>: this option specify what is the destination for the deployment

Run a job

./etl.py -a run \
    -c <config-filename> \
    --deploy-dir <deploy-dir> \
    --version <version> \
    --args <input-json-file>

-c <config-filename>: this option specify the config file
--build-dir <build-dir>: this option specify where to look for the build bits to run
--version <version>: this option specify which version of the app to run
--args <input-json-file>: optional parameter for input variable for the job. The <input-json-file> points to a json file, the value of the file will be passed to job's main function in input_args parameter. If this option is missing, the input_args will be set to {} when calling the main function of the job.
It prints the return value of the main function of the job

Examples

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.130

Jun 5, 2023

0.0.129

May 30, 2023

0.0.128

May 30, 2023

0.0.127

May 26, 2023

0.0.126

May 26, 2023

0.0.125

May 26, 2023

0.0.124

May 26, 2023

0.0.123

May 24, 2023

0.0.122

Apr 24, 2023

0.0.121

Apr 24, 2023

0.0.120

Apr 24, 2023

0.0.119

Apr 24, 2023

0.0.118

Apr 24, 2023

0.0.117

Apr 22, 2023

0.0.116

Apr 21, 2023

0.0.115

Feb 24, 2023

0.0.114

May 2, 2022

0.0.113

May 2, 2022

0.0.112

May 2, 2022

0.0.110

Apr 27, 2022

0.0.109

Dec 10, 2021

0.0.108

Dec 8, 2021

0.0.107

Dec 6, 2021

0.0.106

Dec 6, 2021

0.0.105

Dec 5, 2021

0.0.104

Dec 5, 2021

0.0.103

Nov 2, 2021

0.0.102

Nov 1, 2021

0.0.101

Oct 12, 2021

0.0.100

Aug 28, 2021

0.0.99

Aug 12, 2021

0.0.98

Apr 17, 2021

0.0.97

Apr 15, 2021

0.0.96

Apr 12, 2021

0.0.95

Apr 8, 2021

0.0.94

Apr 8, 2021

0.0.93

Apr 7, 2021

0.0.92

Mar 17, 2021

0.0.91

Mar 9, 2021

0.0.90

Mar 9, 2021

0.0.89

Mar 9, 2021

0.0.88

Mar 8, 2021

0.0.87

Mar 8, 2021

0.0.86

Mar 7, 2021

0.0.85

Mar 3, 2021

0.0.82

Feb 19, 2021

0.0.81

Feb 18, 2021

0.0.80

Feb 18, 2021

0.0.79

Feb 18, 2021

0.0.78

Feb 18, 2021

0.0.77

Feb 18, 2021

0.0.76

Feb 18, 2021

0.0.75

Feb 12, 2021

0.0.71

Feb 10, 2021

0.0.70

Feb 10, 2021

0.0.69

Feb 10, 2021

0.0.68

Feb 10, 2021

0.0.51

Feb 8, 2021

0.0.49

Feb 8, 2021

0.0.46

Feb 3, 2021

0.0.38

Jan 2, 2021

0.0.37

Dec 30, 2020

0.0.31

Dec 29, 2020

0.0.30

Dec 29, 2020

0.0.21

Dec 15, 2020

This version

0.0.11

Nov 25, 2020

0.0.10

Nov 18, 2020

0.0.9

Nov 9, 2020

0.0.8

Nov 6, 2020

0.0.7

Nov 6, 2020

0.0.6

Sep 25, 2020

0.0.5

Sep 21, 2020

0.0.4

Jul 22, 2020

0.0.3

Jul 22, 2020

0.0.1

Jul 22, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.11.tar.gz (12.3 kB view hashes)

Uploaded Nov 25, 2020 Source

Built Distribution

spark_etl-0.0.11-py3-none-any.whl (19.1 kB view hashes)

Uploaded Nov 25, 2020 Python 3

Hashes for spark-etl-0.0.11.tar.gz

Hashes for spark-etl-0.0.11.tar.gz
Algorithm	Hash digest
SHA256	`682a3ce7542c2e490ad3445a81cd4b53aa44ba06c8c4c94bcea47c89b859a6e5`
MD5	`bd2803c4139f826c521b12f1f14115ba`
BLAKE2b-256	`db8e3a97b90c6c6679f0adac7d563098378093fba075bf4cea1aedeaf9854dcc`

Hashes for spark_etl-0.0.11-py3-none-any.whl

Hashes for spark_etl-0.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7002ff4245ebe31102bb009ac407eb5b5517d645ce37ade01ecd645b5d394f18`
MD5	`7506e188bddeba18967c0214164364fc`
BLAKE2b-256	`e10468a18a9231857164c2b9a9733ebf889d2837cef16ae750631e26fda5380d`

spark-etl 0.0.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Goal

Benefit

Supported platform

Concepts

Application

Build an application

Application entry signature

Deployer

Job Submitter

Deploy a job using `etl.py` command: (`examples/etl.py`)

Run a job

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

spark-etl 0.0.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Goal

Benefit

Supported platform

Concepts

Application

Build an application

Application entry signature

Deployer

Job Submitter

Deploy a job using etl.py command: (examples/etl.py)

Run a job

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Deploy a job using `etl.py` command: (`examples/etl.py`)