Generic ETL Pipeline Framework for Apache Spark
Project description
Index
See https://stonezhong.github.io/spark_etl/ for more informaion
Overview
Goal
spark_etl provide a platform independent way of building spark application.
Benefit
Your application deployed and running using spark-etl is spark provider agnostic. Which means, for example, you can move your application from Azure HDInsight to AWS EMR without changing your application's code.
Supported platforms
You setup your own Apache Spark Cluster. | |
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer. | |
You host your spark cluster in databricks | |
You host your spark cluster in Amazon AWS EMR | |
You host your spark cluster in Google Cloud | |
You host your spark cluster in Microsoft Azure HDInsight | |
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service | |
You host your spark cluster in IBM Cloud |
APIs
Application
An application is a pyspark application, so far we only support pyspark, Java and Scala support will be added latter. An application contains:
- A
main.py
file which contain the application entry - A
manifest.json
file, which specify the metadata of the application. - A
requirements.txt
file, which specify the application dependency.
Application class:
- You can create an application via
Application(app_location)
- You can build an application via
app.build(destination_location)
Application entry signature
In your application's main.py
, you shuold have a main
function with the following signature:
spark
is the spark session objectinput_args
a dict, is the argument user specified when running this job.sysops
is the system options passed, it is platform specific. Job submitter may inject platform specific object insysops
object.- Your
main
function's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
Here is an example.
Job Deployer
- A job deployer has method
deploy(build_dir, destination_location)
, it deploy the application to the destination location - spark_etl support the following deployer
spark_etl.vendors.local.LocalDeployer
spark_etl.deployers.HDFSDeployer
spark_etl.vendors.oracle.DataflowDeployer
Job Submitter
-
A job submitter has method
run(deployment_location, options={}, args={}, handlers=[], on_job_submitted=None)
, it submit a deployed job -
spark_etl support the following job submitter
spark_etl.vendors.local.PySparkJobSubmitter
spark_etl.job_submitters.livy_job_submitter.LivyJobSubmitter
spark_etl.vendors.oracle.DataflowJobSubmitter
-
Job summiter's
run
function returns the retrun value from job'smain
function.
Tool: etl.py
Build an application
To build an application, run
./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
-
<app_dir>
is the directory where your application is located. -
<build-dir>
is the directory where you want your build to be deployed- Your build actually located at
<build-dir>/<version>
, where<version>
is specified by application's manifest file
- Your build actually located at
-
Build is mostly platform independent. You need to depend on package oci-core if you intent to use oci Dataflow
Deplay an application
./etl.py -a deploy \
-c <config-filename> \
--build-dir <build-dir> \
--deploy-dir <deploy-dir>
-c <config-filename>
: this option specify the config file to use for the deployment--build-dir <build-dir>
: this option specify where to look for the build bits to deploy--deolpy-dir <deploy-dir>
: this option specify what is the destination for the deployment
Run a job
./etl.py -a run \
-c <config-filename> \
--deploy-dir <deploy-dir> \
--version <version> \
--args <input-json-file>
-c <config-filename>
: this option specify the config file--build-dir <build-dir>
: this option specify where to look for the build bits to run--version <version>
: this option specify which version of the app to run--args <input-json-file>
: optional parameter for input variable for the job. The<input-json-file>
points to a json file, the value of the file will be passed to job's main function ininput_args
parameter. If this option is missing, theinput_args
will be set to{}
when calling themain
function of the job.- It prints the return value of the
main
function of the job
Examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spark_etl-0.0.70-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 072ebae8ed78ec203f1261cebba788f5f442ddd846e6c22e9729a63b13d67df3 |
|
MD5 | 314e03da2b254fa8241b5a3bae120819 |
|
BLAKE2b-256 | 552c0fe8f54e4578a7e5c81c37b42560e4a31a7c617cdcabce673d8be5b990dc |