Skip to main content

Data Pipelines with Spark on AWS

Project description

Yaetos

Yaetos is a framework to write ETLs on top of spark (the python binding, pyspark) and deploy them to Amazon Web Services (AWS). It can run locally (using local datasets and running the process on your machine), or on AWS (using S3 datasets and running the process on an AWS cluster). The emphasis is on simplicity while giving access to the full power of spark for processing large datasets. All job input and output definitions are in a human readable yaml file. It's name stands for "Yet Another ETL Tool on Spark".

  • In the simplest cases, an ETL job can consist of an SQL file only. No need to know any programming for these.
  • In more complex cases, an ETL job can consist of a python file, giving access to Spark dataframes, RDDs and any python library.

Some features:

  • Running locally and on cluster
  • Support dependencies across jobs
  • Support incremental loading and processing
  • Create AWS cluster when needed or piggy back on an existing cluster.
  • ETL code git control-able and unit-testable
  • Can integrate with any python library or spark-ml to build machine learning applications or other.

To try it

Run the installation instructions (see lower) and run this sql example with:

python yaetos/sql_job.py  --sql_file=jobs/examples/ex1_full_sql_job.sql

It will run locally, taking the inputs from a job registry file (jobs_metadata_local.yml) at these lines, transform them based on this ex1_full_sql_job.sql using sparkSQL engine, and dump the output here. To run the same sql example on an AWS cluster, add --deploy=EMR to the same command line above. In that case, inputs and outputs will be taken from S3 at these locations from the jobs_metadata file. If you don't have a cluster available, it will create one and terminate it after the job is finished. You can see the status on the job process in the "steps" tab of your AWS EMR web page.

To run an ETL that showcases manipulation of a spark dataframes, more flexible than the sql example above, run this frameworked pyspark example ex1_frameworked_job.py with this:

python jobs/examples/ex1_frameworked_job.py

To try an example with job dependencies, run ex4_dependency4_job.py with this:

python jobs/examples/ex4_dependency4_job.py --dependencies

It will run all 3 dependencies defined in the jobs_metadata registry. There are other examples in jobs/examples/.

Development Flow

To write a new ETL, create a new file in the jobs/ folder or any subfolders, either a .sql file or a .py file, following the examples from that same folder, and register that job, its inputs and output path locations in conf/jobs_metadata.yml to run the AWS cluster or in conf/jobs_metadata_local.yml to run locally. To run the jobs, execute the command lines following the same patterns as above:

python yaetos/sql_job.py  --sql_file=jobs/examples/some_sql_file.sql
# or
python jobs/examples/ex1_frameworked_job.py

And add the --deploy=EMR to deploy and run on an AWS cluster.

You can specify dependencies in the job registry, for local jobs or on AWS cluster.

Jobs can be unit-tested using py.test. For a given job, create a corresponding job in tests/jobs/ folder and add tests that relate to the specific business logic in this job. See tests/jobs/ex1_frameworked_job_test.pyfor an example.

Unit-testing

... is done using py.test. Run them with:

py.test tests/*  # for all tests
py.test tests/jobs/examples/ex1_frameworked_job.py  # for tests for a specific file

Installation instructions

To avoid installing dependencies on your machine manually, you can run the job from a docker container, with spark and python libraries already setup. The docker setup is included.

pip install yaetos
cd /path/to/an/empty/folder/that/will/contain/pipeline/code
yaetos setup  # to create sub-folders and setup framework files.
yaetos launch_env # to launch the docker container
# From inside the docker container, try a test pipeline with
python jobs/examples/ex1_frameworked_job.py --dependencies

The docker container is setup to share the current folder with the host, so ETL jobs can be written from your host machine, using any IDE, and run from the container directly.

To get jobs executed and/or scheduled in AWS, You need to:

  • fill AWS parameters in conf/config.cfg.
  • have ~/.aws/ folder setup to give access to AWS secret keys. If not, run pip install awscli, and aws configure.

Potential improvements

  • more unit-testing
  • integration with other scheduling tools (airflow...)
  • integration with other resource provisioning tools (kubernetes...)
  • automatic pulling/pushing data from s3 to local (sampled) for local development
  • easier dataset reconciliation
  • ...

Lots of room for improvement. Contributions welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

yaetos-0.9.6-py2.py3-none-any.whl (77.3 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page