Skip to main content

Runtime Operator Isolation in Airflow

Project description

Astronomer Isolation Provider Logo

Isolation Provider

Runtime Operator Isolation in Airflow.

Created with ❤️ by the CSE Team @ Astronomer

Summary

The Isolation Provider provides the IsolatedOperator and the isolationctl CLI. It provides the capacity to run any Airflow Operator in an isolated fashion.

Why use IsolatedOperator?

  • Run a different version of an underlying library, between separate teams on the same Airflow instance, even between separate tasks in the same DAG
  • Run entirely separate versions of Python for a task
  • Keep "heavy" dependencies separate from Airflow
  • Run an Airflow task in a completely separate environment - or on a server all the way across the world.
  • Run a task "safely" - separate from the Airflow instance
  • Run a task with dependencies that conflict more easily
  • Do all of the above while having unmodified access to (almost) all of 'normal' Airflow - operators, XCOMs, logs, deferring, callbacks.

What does the isolationctl provide?

  • the isolationctl gives you an easy way to manage "environments" for your IsolatedOperator

When shouldn't you use IsolatedOperator?

  • If you can use un-isolated Airflow Operators, you still should use un-isolated Airflow Operators.
  • 'Talking back' to Airflow is no longer possible in an IsolatedOperator. You cannot do Variable.set within an IsolatedOperator(operator=PythonOperator), nor can you query the Airflow Metadata Database.

Quickstart

This quickstart utilizes a local Kubernetes cluster, and a local image registry to host isolated environments.

Pre-requisites:

Steps

  1. Download and install the isolationctl CLI via

    pip install apache-airflow-providers-isolation[cli]
    
  2. Set up the project:

    isolationctl init --example --local --local-registry --astro --git --dependency
    
    • --example adds environments/example
    • --local uses kube config view to add a KUBERNETES_DEFAULT Airflow Connection in .env
    • --local-registry runs a docker image registry at localhost:5000
    • --astro runs astro dev init
    • --git runs git init
    • --dependency adds apache-airflow-providers-isolation[kubernetes] to requirements.txt
  3. Add an 'older' version of Pandas to environments/example/requirements.txt

    echo "\npandas==1.4.2" >> environments/example/requirements.txt
    echo "\nbuild-essential" >> environments/example/packages.txt
    
  4. Add the http provider to environments/example/requirements.txt

    pushd environments/example/
    astro registry provider add http
    popd
    
  5. Build the example environment and deploy it to the local registry

    isolationctl deploy --local-registry
    
  6. Add the Kubernetes Provider to the Astro project (not required - it is a transitive dependency - but always good to be explicit)

    astro registry provider add kubernetes
    
  7. Disable OpenLineage just to clean up the logging, locally, for the example. Then start the Airflow Project.

    echo "\nOPENLINEAGE_DISABLED=true" >> .env
    astro dev start --no-browser
    
  8. Run the example DAG

    astro dev run dags test isolation_provider_example_dag
    
  9. 🎉🎉🎉

Installation

pip install apache-airflow-providers-isolation[kubernetes]

CLI

To install the isolationctl CLI

pip install apache-airflow-providers-isolation[cli]

Usage

from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from isolation.operators.isolation import IsolatedOperator

with DAG(
    "isolation_provider_example_dag",
    schedule=None,
    start_date=datetime(1970, 1, 1),
):
    IsolatedOperator(
        task_id="echo_with_bash_operator",
        operator=BashOperator,
        environment="example",
        bash_command="echo {{ params.hi }}",
    )

How

A diagram documenting the flow from IsolatedOperator to KubernetesPodOperator to a pod to PostIsolationHook to OtherOperator

What it is

The IsolatedOperator is a wrapper on top of underlying operator, such as KubernetesPodOperator or potentially other operators, to run some OtherOperator in an isolated environment without any need to know or understand how the operator underlying the IsolatedOperator works - beyond what is absolutely required to run.

It also has the isolationctl CLI which aids in the creation and deployment of isolated environments to run in.

The steps that the IsolatedOperator takes, in more detail:

  1. Airflow initializes an underlying base operator (such as KubernetesPodOperator) based on the IsolatedOperator.__init__
    • Some amount of initial state (e.g. the Airflow, Connections, Variables) will be provided to the Isolated environment via special ENV vars
  2. The underlying operator (such as KubernetesPodOperator) executes as normal
  3. The isolated environment runs within the underlying operator, and is bootstrapped via the PostIsolationHook.
  • State is re-establish via passed in ENV vars
  • In an otherwise un-initialized Airflow, via an internal hok, the OtherOperator is executed
  • This isolated environment's Airflow has no knowledge of the parent Airflow that launched it, it should have no access to the parents' Metadata Database, and it should not be able to communicate back to the parent Airflow.

Most Airflow functionality should work out-of-the-box, simply due to the reliance on the underlying operators to do most of the "heavy lifting" - e.g. XCOMs and Logs

What it isn't

  • If you can use un-isolated Airflow Operators, you still should use un-isolated Airflow Operators. This won't make an operator that would run normally any easier to run.
  • Anything that requires communicating back to the parent Airflow at runtime is unsupported or impossible - e.g. Variable.set(...)
  • Anything that requires querying or modifying the parent Airflow's state at runtime is unsupported or impossible. Examples include @provide_session, or TaskInstance.query(...), or ExternalTaskSensor, or TriggerDagRunOperator
  • It's possible other less-traditional parts of Airflow may not yet be supported, due to development effort - e.g. @task annotations or Airflow 2.7 Setup and Teardowns - depending on how precisely they are invoked.
  • It is possible that things like on_failure_callbacks or lineage data may not work - depending on how exactly they are invoked - but if these things work with via the underlying operator and are set on the underlying operator, then they should work with this.

Requirements

CLI

  • Python ">=3.8,<3.12"
  • You must have write access to a Container Registry, read more at CLI Requirements

Host Airflow

  • Python ">=3.8,<3.12"
  • Airflow >=2.3
  • Must have access to create containers in the target Kubernetes Environment

Target Isolated Environment

  • Python ">=3.8,<3.12"
  • Airflow >=2.3
    • Note: Airflow 2.2 doesn't have test_mode for executing tasks - which is currently utilized to bypass setup of the isolated environment's Airflow
  • Must be a Kubernetes environment

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apache-airflow-providers-isolation-1.0.1.tar.gz (28.9 kB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page