Skip to main content

Generic ETL Pipeline Framework for Apache Spark

Project description

Index

Overview

Goal

spark_etl provide a platform independent way of building spark application.

Benefit

Your application deployed and running using spark-etl is spark provider agnostic. Which means, for example, you can move your application from Azure HDInsight to AWS EMR without changing your application's code.

Supported platforms

You setup your own Apache Spark Cluster.
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer.
You host your spark cluster in databricks
You host your spark cluster in Amazon AWS EMR
You host your spark cluster in Google Cloud
You host your spark cluster in Microsoft Azure HDInsight
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service
You host your spark cluster in IBM Cloud

APIs

Application

An application is a pyspark application, so far we only support pyspark, Java and Scala support will be added latter. An application contains:

  • A main.py file which contain the application entry
  • A manifest.json file, which specify the metadata of the application.
  • A requirements.txt file, which specify the application dependency.

Application class:

  • You can create an application via Application(app_location)
  • You can build an application via app.build(destination_location)

Application entry signature

In your application's main.py, you shuold have a main function with the following signature:

  • spark is the spark session object
  • input_args a dict, is the argument user specified when running this job.
  • sysops is the system options passed, it is platform specific. Job submitter may inject platform specific object in sysops object.
  • Your main function's return value will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
    # your code here

Here is an example.

Job Deployer

  • A job deployer has method deploy(build_dir, destination_location), it deploy the application to the destination location
  • spark_etl support the following deployer
    • spark_etl.vendors.local.LocalDeployer
    • spark_etl.deployers.HDFSDeployer
    • spark_etl.vendors.oracle.DataflowDeployer

Job Submitter

  • A job submitter has method run(deployment_location, options={}, args={}, handlers=[], on_job_submitted=None), it submit a deployed job

  • spark_etl support the following job submitter

    • spark_etl.vendors.local.PySparkJobSubmitter
    • spark_etl.job_submitters.livy_job_submitter.LivyJobSubmitter
    • spark_etl.vendors.oracle.DataflowJobSubmitter
  • Job summiter's run function returns the retrun value from job's main function.

Tool: etl.py

Build an application

To build an application, run

./etl.py -a build --app-dir <app-dir> --build-dir <build-dir>
  • <app_dir> is the directory where your application is located.

  • <build-dir> is the directory where you want your build to be deployed

    • Your build actually located at <build-dir>/<version>, where <version> is specified by application's manifest file
  • Build is mostly platform independent. You need to depend on package oci-core if you intent to use oci Dataflow

Deplay an application

./etl.py -a deploy \
    -c <config-filename> \
    --build-dir <build-dir> \
    --deploy-dir <deploy-dir>
  • -c <config-filename>: this option specify the config file to use for the deployment
  • --build-dir <build-dir>: this option specify where to look for the build bits to deploy
  • --deolpy-dir <deploy-dir>: this option specify what is the destination for the deployment

Run a job

./etl.py -a run \
    -c <config-filename> \
    --deploy-dir <deploy-dir> \
    --version <version> \
    --args <input-json-file>
  • -c <config-filename>: this option specify the config file
  • --build-dir <build-dir>: this option specify where to look for the build bits to run
  • --version <version>: this option specify which version of the app to run
  • --args <input-json-file>: optional parameter for input variable for the job. The <input-json-file> points to a json file, the value of the file will be passed to job's main function in input_args parameter. If this option is missing, the input_args will be set to {} when calling the main function of the job.
  • It prints the return value of the main function of the job

Examples

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-etl-0.0.46.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_etl-0.0.46-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file spark-etl-0.0.46.tar.gz.

File metadata

  • Download URL: spark-etl-0.0.46.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark-etl-0.0.46.tar.gz
Algorithm Hash digest
SHA256 557faf01e90cda7d3a28e1119c28e4cc637a66205b382dda76964733833bb09f
MD5 5ba6b7ece500c04b4728d455e28059d8
BLAKE2b-256 0947cae83605b134d810f39df4ee66ba391f39f2280fe6ae357bdd2340bca0b8

See more details on using hashes here.

File details

Details for the file spark_etl-0.0.46-py3-none-any.whl.

File metadata

  • Download URL: spark_etl-0.0.46-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.9

File hashes

Hashes for spark_etl-0.0.46-py3-none-any.whl
Algorithm Hash digest
SHA256 1162d5d5bcd19fe9916d64a14c8878ac4c835645a0806aac14f9018ece9c428f
MD5 0995ec74b66f4fdb19d2ff7921d57c73
BLAKE2b-256 e50822772c06badbc62c80f05841c3cce11723a7831df8ac47756294a3d3f827

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page