Generic ETL Pipeline Framework for Apache Spark
Project description
Overview
Goal
spark_etl provide a platform independent way of building spark application.
Benefit
- Your application can be moved to different spark platform without change or with very little change.
Supported platform
- Local spark via pyspark package.
- Spark cluster with Livy Interface
- Oracle Dataflow
Concepts
Application
An application is the code for a spark job. It contains:
- A
main.pyfile which contain the application entry - A
manifest.jsonfile, which specify the metadata of the application. - A
requirements.txtfile, which specify the application dependency.
See examples/myapp for example.
Build an application
To build an application, run
./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
-
<app_dir>is the directory where your application is located. -
<build-dir>is the directory where you want your build to be deployed- Your build actually located at
<build-dir>/<version>, where<version>is specified by application's manifest file
- Your build actually located at
-
Build is mostly platform independent. You need to depend on package oci-core if you intent to use oci Dataflow
Application entry signature
In your application's main.py, you shuold have a main function with the following signature:
sparkis the spark session objectinput_argsa dict, is the argument user specified when running this job.sysopsis the system options passed, it is platform specific.- Your
mainfunction's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
See here for example.
Deployer
- spark_etl support the following deployer
spark_etl.vendors.local.LocalDeployerspark_etl.deployers.HDFSDeployerspark_etl.vendors.oracle.DataflowDeployer
the etl.py command use the config file to decide which deployer to use
Job Submitter
- spark_etl support the following job submitter
spark_etl.vendors.local.PySparkJobSubmitterspark_etl.job_submitters.livy_job_submitter.LivyJobSubmitterspark_etl.vendors.oracle.DataflowJobSubmitter
- Job summiter's
runfunction returns the retrun value from job'smainfunction.
the etl.py command use the config file to decide with job submitter to use
Deploy a job using etl.py command: (examples/etl.py)
./etl.py -a deploy \
-c <config-filename> \
--build-dir <build-dir> \
--deploy-dir <deploy-dir>
-c <config-filename>: this option specify the config file to use for the deployment--build-dir <build-dir>: this option specify where to look for the build bits to deploy--deolpy-dir <deploy-dir>: this option specify what is the destination for the deployment
Run a job
./etl.py -a run \
-c <config-filename> \
--deploy-dir <deploy-dir> \
--version <version> \
--args <input-json-file>
-c <config-filename>: this option specify the config file--build-dir <build-dir>: this option specify where to look for the build bits to run--version <version>: this option specify which version of the app to run--args <input-json-file>: optional parameter for input variable for the job. The<input-json-file>points to a json file, the value of the file will be passed to job's main function ininput_argsparameter. If this option is missing, theinput_argswill be set to{}when calling themainfunction of the job.- It prints the return value of the
mainfunction of the job
Examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark-etl-0.0.10.tar.gz.
File metadata
- Download URL: spark-etl-0.0.10.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f2dfad15ee509cc33211dc97ca3bb689da3869b0de8fea8205d3dee4367a925
|
|
| MD5 |
42c3912ddfcf1d04c977ddaf65acb65e
|
|
| BLAKE2b-256 |
2ba578977a2e57054e3596a880f806af6e152616d8277f6ad4c63d990b21dd36
|
File details
Details for the file spark_etl-0.0.10-py3-none-any.whl.
File metadata
- Download URL: spark_etl-0.0.10-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6e0d9ad8542ba8d89be623dd22c7f626ca6767b970d6cb685c2005314d90b0e
|
|
| MD5 |
57d7b334069d0d38d5d5d95412b3c02b
|
|
| BLAKE2b-256 |
8bf7941a07221157b33f6448598072cf1947ee0fa76822717ea655aa3c437d34
|