Generic ETL Pipeline Framework for Apache Spark
Project description
Index
See https://stonezhong.github.io/spark_etl/ for more informaion
Overview
Goal
spark_etl provide a platform independent way of building spark application.
Benefit
Your application deployed and running using spark-etl is spark provider agnostic. Which means, for example, you can move your application from Azure HDInsight to AWS EMR without changing your application's code.
Supported platforms
|
You setup your own Apache Spark Cluster. |
|
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer. |
|
You host your spark cluster in databricks |
|
You host your spark cluster in Amazon AWS EMR |
|
You host your spark cluster in Google Cloud |
|
You host your spark cluster in Microsoft Azure HDInsight |
|
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service |
|
You host your spark cluster in IBM Cloud |
APIs
Application
An application is a pyspark application, so far we only support pyspark, Java and Scala support will be added latter. An application contains:
- A
main.py
file which contain the application entry - A
manifest.json
file, which specify the metadata of the application. - A
requirements.txt
file, which specify the application dependency.
Application class:
- You can create an application via
Application(app_location)
- You can build an application via
app.build(destination_location)
Application entry signature
In your application's main.py
, you shuold have a main
function with the following signature:
spark
is the spark session objectinput_args
a dict, is the argument user specified when running this job.sysops
is the system options passed, it is platform specific. Job submitter may inject platform specific object insysops
object.- Your
main
function's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
Here is an example.
Job Deployer
- A job deployer has method
deploy(build_dir, destination_location)
, it deploy the application to the destination location - spark_etl support the following deployer
spark_etl.vendors.local.LocalDeployer
spark_etl.deployers.HDFSDeployer
spark_etl.vendors.oracle.DataflowDeployer
Job Submitter
-
A job submitter has method
run(deployment_location, options={}, args={}, handlers=[], on_job_submitted=None)
, it submit a deployed job -
spark_etl support the following job submitter
spark_etl.vendors.local.PySparkJobSubmitter
spark_etl.job_submitters.livy_job_submitter.LivyJobSubmitter
spark_etl.vendors.oracle.DataflowJobSubmitter
-
Job summiter's
run
function returns the retrun value from job'smain
function.
Tool: etl.py
Build an application
To build an application, run
./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
-
<app_dir>
is the directory where your application is located. -
<build-dir>
is the directory where you want your build to be deployed- Your build actually located at
<build-dir>/<version>
, where<version>
is specified by application's manifest file
- Your build actually located at
-
Build is mostly platform independent. You need to depend on package oci-core if you intent to use oci Dataflow
Deplay an application
./etl.py -a deploy \
-c <config-filename> \
--build-dir <build-dir> \
--deploy-dir <deploy-dir>
-c <config-filename>
: this option specify the config file to use for the deployment--build-dir <build-dir>
: this option specify where to look for the build bits to deploy--deolpy-dir <deploy-dir>
: this option specify what is the destination for the deployment
Run a job
./etl.py -a run \
-c <config-filename> \
--deploy-dir <deploy-dir> \
--version <version> \
--args <input-json-file>
-c <config-filename>
: this option specify the config file--build-dir <build-dir>
: this option specify where to look for the build bits to run--version <version>
: this option specify which version of the app to run--args <input-json-file>
: optional parameter for input variable for the job. The<input-json-file>
points to a json file, the value of the file will be passed to job's main function ininput_args
parameter. If this option is missing, theinput_args
will be set to{}
when calling themain
function of the job.- It prints the return value of the
main
function of the job
Examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spark_etl-0.0.75-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d30c524fad2ffaf16463d757602e6bcd117e16763bba53babaf99634c8fae1f |
|
MD5 | a2a596300e43af68226a57658acb1c85 |
|
BLAKE2b-256 | 4070f0f897f25a8ed78d5e99a9c022e69e5bd4754848a59e4d22a2bb0d5ac926 |