Skip to main content

Generic ETL Pipeline Framework for Apache Spark

Project description

Overview

Goal

spark_etl provide a platform independent way of building spark application.

Benefit

  • Your application can be moved to different spark platform without change or with very little change.

Supported platform

  • Local spark via pyspark package.
  • Spark cluster with Livy Interface
  • Oracle Dataflow

Concepts

Application

An application is the code for a spark job. It contains:

  • A main.py file which contain the application entry
  • A manifest.json file, which specify the metadata of the application.
  • A requirements.txt file, which specify the application dependency.

See examples/myapp for example.

Build an application

To build an application, run

./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
  • <app_dir> is the directory where your application is located.

  • <build-dir> is the directory where you want your build to be deployed

    • Your build actually located at <build-dir>/<version>, where <version> is specified by application's manifest file
  • Build is mostly platform independent. You need to depend on package oci-core if you intent to use oci Dataflow

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

  • spark is the spark session object
  • input_args a dict, is the argument user specified when running this job.
  • sysops is the system options passed, it is platform specific.
  • Your main function's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
    # your code here

See here for example.

Deployer

  • spark_etl support the following deployer
    • spark_etl.vendors.local.LocalDeployer
    • spark_etl.deployers.HDFSDeployer
    • spark_etl.vendors.oracle.DataflowDeployer

the etl.py command use the config file to decide which deployer to use

Job Submitter

  • spark_etl support the following job submitter
    • spark_etl.vendors.local.PySparkJobSubmitter
    • spark_etl.job_submitters.livy_job_submitter.LivyJobSubmitter
    • spark_etl.vendors.oracle.DataflowJobSubmitter
  • Job summiter's run function returns the retrun value from job's main function.

the etl.py command use the config file to decide with job submitter to use

Deploy a job using etl.py command: (examples/etl.py)

./etl.py -a deploy \
    -c <config-filename> \
    --build-dir <build-dir> \
    --deploy-dir <deploy-dir>
  • -c <config-filename>: this option specify the config file to use for the deployment
  • --build-dir <build-dir>: this option specify where to look for the build bits to deploy
  • --deolpy-dir <deploy-dir>: this option specify what is the destination for the deployment

Run a job

./etl.py -a run \
    -c <config-filename> \
    --deploy-dir <deploy-dir> \
    --version <version> \
    --args <input-json-file>
  • -c <config-filename>: this option specify the config file
  • --build-dir <build-dir>: this option specify where to look for the build bits to run
  • --version <version>: this option specify which version of the app to run
  • --args <input-json-file>: optional parameter for input variable for the job. The <input-json-file> points to a json file, the value of the file will be passed to job's main function in input_args parameter. If this option is missing, the input_args will be set to {} when calling the main function of the job.
  • It prints the return value of the main function of the job

Examples

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.10.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_etl-0.0.10-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file spark-etl-0.0.10.tar.gz.

File metadata

  • Download URL: spark-etl-0.0.10.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark-etl-0.0.10.tar.gz
Algorithm Hash digest
SHA256 0f2dfad15ee509cc33211dc97ca3bb689da3869b0de8fea8205d3dee4367a925
MD5 42c3912ddfcf1d04c977ddaf65acb65e
BLAKE2b-256 2ba578977a2e57054e3596a880f806af6e152616d8277f6ad4c63d990b21dd36

See more details on using hashes here.

File details

Details for the file spark_etl-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: spark_etl-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark_etl-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 f6e0d9ad8542ba8d89be623dd22c7f626ca6767b970d6cb685c2005314d90b0e
MD5 57d7b334069d0d38d5d5d95412b3c02b
BLAKE2b-256 8bf7941a07221157b33f6448598072cf1947ee0fa76822717ea655aa3c437d34

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page