Generic ETL Pipeline Framework for Apache Spark
Project description
See https://stonezhong.github.io/spark_etl/ for more informaion
Overview
Goal
There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a complete list.
However, each platform has it's own way of launching Spark jobs, and the way to launch spark jobs between platforms are not compatible with each other.
spark-etl is a python package, which simplifies the spark application management cross platforms, with 3 uniformed steps:
- Build your spark application
- Deploy your spark application
- Run your spark application
Benefit
Your application using spark-etl is spark provider agnostic. For example, you can move your application from Azure HDInsight to AWS EMR without changing your application's code.
You can also run a down-scaled version of your data lake with pyspark in a laptop, since pyspark is a supported spark platform, with this feature, you can validate your spark application with pyspark on your laptop, instead of run it in cloud, to save cost.
Supported platforms
|
You setup your own Apache Spark Cluster. |
|
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer. |
|
You host your spark cluster in databricks |
|
You host your spark cluster in Amazon AWS EMR |
|
You host your spark cluster in Google Cloud |
|
You host your spark cluster in Microsoft Azure HDInsight |
|
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service |
|
You host your spark cluster in IBM Cloud |
Deploy and run application
Please see the Demos
APIs
Application
An application is a pyspark application, so far we only support pyspark, Java and Scala support will be added latter. An application contains:
- A
main.py
file which contain the application entry - A
manifest.json
file, which specify the metadata of the application. - A
requirements.txt
file, which specify the application dependency.
Application class:
- You can create an application via
Application(app_location)
- You can build an application via
app.build(destination_location)
Application entry signature
In your application's main.py
, you shuold have a main
function with the following signature:
spark
is the spark session objectinput_args
a dict, is the argument user specified when running this job.sysops
is the system options passed, it is platform specific. Job submitter may inject platform specific object insysops
object.- Your
main
function's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
Here is an example.
Job Deployer
For job deployers, please check the wiki .
Job Submitter
For job submitters, please check the wiki
Tool: etl.py
Build an application
To build an application, run
./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
-
<app_dir>
is the directory where your application is located. -
<build-dir>
is the directory where you want your build to be deployed- Your build actually located at
<build-dir>/<version>
, where<version>
is specified by application's manifest file
- Your build actually located at
-
Build is mostly platform independent. You can put platform related package in file
common_requirements.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spark_etl-0.0.92-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9f4da811fed499449610573a2856513e20a9187d9d7aa498fcf2f8926db1e62 |
|
MD5 | f04882a85b4019e3217eda4841d4599d |
|
BLAKE2b-256 | a224c3622ee1e4cfafdae9b0b7bdc69ec9443fff44a601654d2f9d7594eacc5f |