Generic ETL Pipeline Framework for Apache Spark
Project description
See https://stonezhong.github.io/spark_etl/ for more informaion
Overview
Goal
There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a complete list.
However, each platform has it's own way of launching Spark jobs, and the way to launch spark jobs between platforms are not compatible with each other.
spark-etl is a python package, which simplifies the spark application management cross platforms, with 3 uniformed steps:
- Build your spark application
- Deploy your spark application
- Run your spark application
Benefit
Your application using spark-etl is spark provider agnostic. For example, you can move your application from Azure HDInsight to AWS EMR without changing your application's code.
You can also run a down-scaled version of your data lake with pyspark in a laptop, since pyspark is a supported spark platform, with this feature, you can validate your spark application with pyspark on your laptop, instead of run it in cloud, to save cost.
Supported platforms
|
|
You setup your own Apache Spark Cluster. |
|
|
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer. |
|
|
You host your spark cluster in databricks |
|
|
You host your spark cluster in Amazon AWS EMR |
|
|
You host your spark cluster in Google Cloud |
|
|
You host your spark cluster in Microsoft Azure HDInsight |
|
|
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service |
|
|
You host your spark cluster in IBM Cloud |
Deploy and run application
Please see the Demos
APIs
Application
An application is a pyspark application, so far we only support pyspark, Java and Scala support will be added latter. An application contains:
- A
main.pyfile which contain the application entry - A
manifest.jsonfile, which specify the metadata of the application. - A
requirements.txtfile, which specify the application dependency.
Application class:
- You can create an application via
Application(app_location) - You can build an application via
app.build(destination_location)
Application entry signature
In your application's main.py, you shuold have a main function with the following signature:
sparkis the spark session objectinput_argsa dict, is the argument user specified when running this job.sysopsis the system options passed, it is platform specific. Job submitter may inject platform specific object insysopsobject.- Your
mainfunction's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
Here is an example.
Job Deployer
For job deployers, please check the wiki .
Job Submitter
For job submitters, please check the wiki
Tool: etl.py
Build an application
To build an application, run
./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
-
<app_dir>is the directory where your application is located. -
<build-dir>is the directory where you want your build to be deployed- Your build actually located at
<build-dir>/<version>, where<version>is specified by application's manifest file
- Your build actually located at
-
Build is mostly platform independent. You can put platform related package in file
common_requirements.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark-etl-0.0.92.tar.gz.
File metadata
- Download URL: spark-etl-0.0.92.tar.gz
- Upload date:
- Size: 20.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdefc378b8079e99439912acf8b3e4b2e8172fcc20d2df12ac525bfe3da1b852
|
|
| MD5 |
082f719bf858912fb06c5f9f796e566b
|
|
| BLAKE2b-256 |
944b136deba375279ebcdbd952f0924e56fa947f9a2ca994871029e6b7b93480
|
File details
Details for the file spark_etl-0.0.92-py3-none-any.whl.
File metadata
- Download URL: spark_etl-0.0.92-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9f4da811fed499449610573a2856513e20a9187d9d7aa498fcf2f8926db1e62
|
|
| MD5 |
f04882a85b4019e3217eda4841d4599d
|
|
| BLAKE2b-256 |
a224c3622ee1e4cfafdae9b0b7bdc69ec9443fff44a601654d2f9d7594eacc5f
|