Skip to main content

Data Pipelines with Spark on AWS

Project description

Yaetos

TODO: update

This is a framework to write ETLs on top of spark (the python binding, pyspark) and deploy them to Amazon Web Services (AWS). It can run locally (using local datasets and running the process on your machine), or on AWS (using S3 datasets and running the process on an AWS cluster). The emphasis was on simplicity while giving access to the full power of spark for processing large datasets. All job input and output definitions are in a human readable yaml file. It's name stands for "Yet Another ETL Tool on Spark".

  • In the simplest cases, an ETL job can consist of an SQL file only. No need to know any programming for these.
  • In more complex cases, an ETL job can consist of a python file, giving access to Spark dataframes, RDDs and any python library.

Some features:

  • Running locally and on cluster
  • Support dependencies across jobs
  • Support incremental loading and processing
  • Create AWS cluster when needed or piggy back on an existing cluster.
  • ETL code git control-able and unit-testable
  • Can integrate with any python library or spark-ml to build machine learning applications or other.

To try it

Run the installation instructions (see lower) and run this sql example with:

python yaetos/sql_job.py  --sql_file=jobs/examples/ex1_full_sql_job.sql

It will run locally, taking the inputs from a job registry file (jobs_metadata_local.yml) at these lines, transform them based on this ex1_full_sql_job.sql using sparkSQL engine, and dump the output here. To run that same sql example on an AWS cluster, add a -d (TODO: check) argument in the command line above. In that case, inputs and outputs will be taken from S3 at these locations from the jobs_metadata file. If you don't have a cluster available, it will create one and terminate it after the job is finished. You can see the status on the job process in the "steps" tab of your AWS EMR web page.

To run an ETL that showcases manipulation of a spark dataframes, more flexible than the sql example above, run this frameworked pyspark example ex1_frameworked_job.py with this:

python jobs/examples/ex1_frameworked_job.py

To try an example with job dependencies, run ex4_dependency4_job.py with this:

python jobs/examples/ex4_dependency4_job.py -x

It will run all 3 dependencies defined in the jobs_metadata registry. There are other examples in jobs/examples/.

Development Flow

To write a new ETL, create a new file in the jobs/ folder or any subfolders, either a .sql file or a .py file, following the examples from that same folder, and register that job, its inputs and output path locations in conf/jobs_metadata.yml to run the AWS cluster or in conf/jobs_metadata_local.yml to run locally. To run the jobs, execute the command lines following the same patterns as above:

python yaetos/sql_job.py  --sql_file=jobs/examples/same_sql_file.sql
# or
python jobs/examples/ex1_frameworked_job.py

And add the -d (TODO: check )to deploy and run on an AWS cluster.

You can specify dependencies in the job registry, for local jobs or on AWS cluster.

Jobs can be unit-tested using py.test. For a given job, create a corresponding job in tests/jobs/ folder and add tests that relate to the specific business logic in this job. See tests/jobs/ex1_frameworked_job_test.pyfor an example.

Unit-testing

... is done using py.test. Run them with:

py.test tests/*  # for all tests
py.test tests/jobs/examples/ex1_frameworked_job.py  # for tests for a specific file

Installation instructions

To avoid installing dependencies on your machine manually, you can run the job from a docker container, with spark and python libraries already setup. A Dockerfile is included to create this container.

pip install ... cd ~/path/to/repo/ docker build -t spark_container . # '.' matters docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -v /absolute/path/to/pyspark_aws_etl:/mnt/pyspark_aws_etl -v ~/.aws:/root/.aws -h spark spark_container # remove "-v ~/.aws:/root/.aws" if you don't intend sending jobs to AWS.

It will bring you inside the container's bash terminal, from where you can run the jobs. This docker container is setup to take the repository from your host, so you can write ETL jobs from your host machine and run them from within the container.

To send jobs to AWS cluster, You also need to copy the config file conf/config.cfg.example, save it as conf/config.cfg, and fill in your AWS setup. You should also have your ~/.aws folder setup (by aws command line) with the corresponding AWS account information and secret keys.

If you want to run the example jobs, then you need to run scripts/setup_examples.sh, again from your host machine or from within the docker container. It will download some small input dataset to your computer and push it to amazon S3 storage. Note that it involves creating a bucket on your S3 account manually and setting its name at the top of scripts/setup_examples.sh. (TODO: update)

Potential improvements

  • more unit-testing
  • integration with other scheduling tools (airflow...)
  • automatic pulling/pushing data from s3 to local (sampled) for local development
  • easier dataset reconciliation
  • ...

Lots of room for improvement. Contributions welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

yaetos-0.9.3-py2.py3-none-any.whl (60.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page