Skip to main content

An opinionated framework for ETL built on top of Airflow

Project description

gusty

Versions PyPi build coverage

gusty allows you to manage your Airflow DAGs and tasks with greater ease. Instead of writing your DAGs, tasks, and dependencies in a .py file, you can instead specify DAGs and tasks in .yml files, and designate task dependencies within a task's .yml file, as well.

In addition to parsing .yml files, gusty also parses YAML front matter in .ipynb and .Rmd files, allowing you to include Python and R notebook formats in your data pipeline.

gusty works with both Airflow 1.x and Airflow 2.x.

Hello World

Tasks

Instead of importing and calling a BashOperator directly, you can specify the operator and the command parameter (which is a required field for Airflow's BashOperator) in a .yml:

operator: airflow.operators.bash.BashOperator
bash_command: echo hello world

gusty takes the above .yml and turns it into a task based on its file name. If this file was called hello_world.yml, the resulting task would show up in your DAG as hello_world.

You can also set dependencies between jobs in .yml as well. Here is another task, goodbye_world. that depends on hello_world.

operator: airflow.operators.bash.BashOperator
bash_command: echo goodbye world
dependencies:
  - hello_world

This will automatically set the goodbye_world task downstream of the hello_world task.

External dependencies can also be set using the format:

external_dependencies:
  - dag: task

To wait for an entire external DAG to run successfully just use dag: all instead.

DAGs

Your DAGs can also be represented as .yml files. Specifically, DAGs should be represented in a file called METADATA.yml. Similar to the basic Airflow tutorial, our DAG might look something like this:

description: "A Gusty version of the DAG described by this Airflow tutorial: https://airflow.apache.org/docs/stable/tutorial.html"
schedule_interval: "1 0 * * *"
default_args:
    owner: airflow
    depends_on_past: False
    start_date: !days_ago 1
    email: airflow@example.com
    email_on_failure: False
    email_on_retry: False
    retries: 1
    retry_delay: !timedelta 'minutes: 5'
#   queue: bash_queue
#   pool: backfill
#   priority_weight: 10
#   end_date: !datetime [2016, 1, 1]
#   wait_for_downstream: false
#   sla: !timedelta 'hours: 2'
#   trigger_rule: all_success

By default, gusty will create a latest_only DAG, where every job in the DAG will only run for the most recent run date, regardless of if a backfill is called. You can disable this behavior by adding latest_only: False to the default_args block above.

GustyDAG

To have gusty generate a DAG, you can use the GustyDAG class, which just needs an (absolute) path to a directory that contains a METADATA.yml for the DAG and .yml files for the tasks. You must also import airflow. An example of the entire .py file that generates your DAG looks like this:

import airflow
from gusty import GustyDAG

dag = GustyDAG('/usr/local/airflow/dags/hello_world')

The resulting DAG will be named after the directory, in this case, hello_world.

Operators

Airflow Operators

gusty will take parameterized .yml for any operator located in airflow.operators and airflow.contrib.operators. In theory, if it's available in these modules, you can use a .yml to define it.

Custom Operators

gusty will also work with any of your custom operators, so long as those operators are located in an operators directory in your designated AIRFLOW_HOME.

In order for your local operators to import properly, they must follow the pattern of having a snake_case file name and a CamelCase operator name, for example the filename of an operator called YourOperator must be called your_operator.py.

Just as the BashOperator above was accessed via with full module path prepended, airflow.operators.bash.BashOperator, your local operators are accessed via the local keyword, e.g. local.YourOperator.

Demo

You can use a containerized demo of gusty and Airflow over at the gusty-demo, and see all of the above in practice.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gusty-0.2.1.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gusty-0.2.1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file gusty-0.2.1.tar.gz.

File metadata

  • Download URL: gusty-0.2.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for gusty-0.2.1.tar.gz
Algorithm Hash digest
SHA256 fa28043dd0fc7fd2db4a029e07fbaad08b9774feb8fd00e54a42b4a8152fb9c1
MD5 309ee99886f0f9e3b5fc2560ceeae305
BLAKE2b-256 3f6d51bfe8666b4c0e33c5e1833d093d64692e8dcae994c61a74ca19a90b843d

See more details on using hashes here.

File details

Details for the file gusty-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: gusty-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for gusty-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4e948844c8b25cc387fc99dc9dd3f839f1394d73ba3a3901842c4d1e2fb4f0b9
MD5 02ac9c54f999be9d1ca3740e9cad7ff6
BLAKE2b-256 c3b73ee820013cf8bb5eaa961ca7eed691b90378a72bd5ece613d72410e5cca9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page