Runtime Operator Isolation in Airflow
Project description
Isolation Provider
Runtime Operator Isolation in Airflow.
Created with ❤️ by the CSE Team @ Astronomer
Summary
The Isolation Provider provides the IsolatedOperator
and the isolationctl
CLI.
It provides the capacity to run any Airflow Operator in an isolated fashion.
Why use IsolatedOperator
?
- Run a different version of an underlying library, between separate teams on the same Airflow instance, even between separate tasks in the same DAG
- Run entirely separate versions of Python for a task
- Keep "heavy" dependencies separate from Airflow
- Run an Airflow task in a completely separate environment - or on a server all the way across the world.
- Run a task "safely" - separate from the Airflow instance
- Run a task with dependencies that conflict more easily
- Do all of the above while having unmodified access to (almost) all of 'normal' Airflow - operators, XCOMs, logs, deferring, callbacks.
What does the isolationctl
provide?
- the
isolationctl
gives you an easy way to manage "environments" for yourIsolatedOperator
When shouldn't you use IsolatedOperator
?
- If you can use un-isolated Airflow Operators, you still should use un-isolated Airflow Operators.
- 'Talking back' to Airflow is no longer possible in an
IsolatedOperator
. You cannot doVariable.set
within anIsolatedOperator(operator=PythonOperator)
, nor can you query the Airflow Metadata Database.
Quickstart
This quickstart utilizes a local Kubernetes cluster, and a local image registry to host isolated environments.
Pre-requisites:
- Docker is installed
- A local Kubernetes ( e.g. Docker Desktop, Minikube, etc) is running
-
astro
CLI is installed
Steps
-
Download and install the
isolationctl
CLI viapip install apache-airflow-providers-isolation[cli]
-
Set up the project:
isolationctl init --example --local --local-registry --astro --git --dependency
--example
addsenvironments/example
--local
useskube config view
to add aKUBERNETES_DEFAULT
Airflow Connection in.env
--local-registry
runs a docker image registry atlocalhost:5000
--astro
runsastro dev init
--git
runsgit init
--dependency
addsapache-airflow-providers-isolation[kubernetes]
torequirements.txt
-
Add an 'older' version of Pandas to
environments/example/requirements.txt
echo "\npandas==1.4.2" >> environments/example/requirements.txt echo "\nbuild-essential" >> environments/example/packages.txt
-
Add the http provider to
environments/example/requirements.txt
pushd environments/example/ astro registry provider add http popd
-
Build the
example
environment and deploy it to the local registryisolationctl deploy --local-registry
-
Add the Kubernetes Provider to the Astro project (not required - it is a transitive dependency - but always good to be explicit)
astro registry provider add kubernetes
-
Disable OpenLineage just to clean up the logging, locally, for the example. Then start the Airflow Project.
echo "\nOPENLINEAGE_DISABLED=true" >> .env astro dev start --no-browser
-
Run the example DAG
astro dev run dags test isolation_provider_example_dag
-
🎉🎉🎉
Installation
pip install apache-airflow-providers-isolation[kubernetes]
CLI
To install the isolationctl CLI
pip install apache-airflow-providers-isolation[cli]
Usage
from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
from isolation.operators.isolation import IsolatedOperator
with DAG(
"isolation_provider_example_dag",
schedule=None,
start_date=datetime(1970, 1, 1),
):
IsolatedOperator(
task_id="echo_with_bash_operator",
operator=BashOperator,
environment="example",
bash_command="echo {{ params.hi }}",
)
How
What it is
The IsolatedOperator
is a wrapper on top of underlying operator,
such as KubernetesPodOperator
or potentially other operators,
to run some OtherOperator
in an isolated environment without any need to know or understand how the operator
underlying the IsolatedOperator
works - beyond what is absolutely required to run.
It also has the isolationctl CLI which aids in the creation and deployment of isolated environments to run in.
The steps that the IsolatedOperator
takes, in more detail:
- Airflow initializes an underlying base operator (such as
KubernetesPodOperator
) based on theIsolatedOperator.__init__
- Some amount of initial state (e.g. the Airflow, Connections, Variables) will be provided to the Isolated environment via special ENV vars
- The underlying operator (such as
KubernetesPodOperator
) executes as normal - The isolated environment runs within the underlying operator,
and is bootstrapped via the
PostIsolationHook
.
- State is re-establish via passed in ENV vars
- In an otherwise un-initialized Airflow, via an internal hok, the
OtherOperator
is executed - This isolated environment's Airflow has no knowledge of the parent Airflow that launched it, it should have no access to the parents' Metadata Database, and it should not be able to communicate back to the parent Airflow.
Most Airflow functionality should work out-of-the-box, simply due to the reliance on the underlying operators to do most of the "heavy lifting" - e.g. XCOMs and Logs
What it isn't
- If you can use un-isolated Airflow Operators, you still should use un-isolated Airflow Operators. This won't make an operator that would run normally any easier to run.
- Anything that requires communicating back to the parent Airflow at runtime is unsupported or impossible -
e.g.
Variable.set(...)
- Anything that requires querying or modifying the parent Airflow's state at runtime is unsupported or impossible.
Examples include
@provide_session
, orTaskInstance.query(...)
, orExternalTaskSensor
, orTriggerDagRunOperator
- It's possible other less-traditional parts of Airflow may not yet be supported,
due to development effort - e.g.
@task
annotations or Airflow 2.7 Setup and Teardowns - depending on how precisely they are invoked. - It is possible that things like
on_failure_callback
s or lineage data may not work - depending on how exactly they are invoked - but if these things work with via the underlying operator and are set on the underlying operator, then they should work with this.
Requirements
CLI
- Python ">=3.8,<3.12"
- You must have write access to a Container Registry, read more at CLI Requirements
Host Airflow
- Python ">=3.8,<3.12"
- Airflow >=2.3
- Must have access to create containers in the target Kubernetes Environment
Target Isolated Environment
- Python ">=3.8,<3.12"
- Airflow >=2.3
- Note: Airflow 2.2 doesn't have
test_mode
for executing tasks - which is currently utilized to bypass setup of the isolated environment's Airflow
- Note: Airflow 2.2 doesn't have
- Must be a Kubernetes environment
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file apache-airflow-providers-isolation-1.0.1.tar.gz
.
File metadata
- Download URL: apache-airflow-providers-isolation-1.0.1.tar.gz
- Upload date:
- Size: 28.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fcd647b19a9194e56adb9f1958d306f4e051b51191239a109a24cc10a961170d |
|
MD5 | 5fff65aa96834155a9c59da40076c704 |
|
BLAKE2b-256 | 184e4b2dc6fbfa3764c7d8079e168aaae9a13986833f468d3d373646e250650c |
File details
Details for the file apache_airflow_providers_isolation-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: apache_airflow_providers_isolation-1.0.1-py3-none-any.whl
- Upload date:
- Size: 32.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec65cc850cd37da4d03c853d2e843115dcb37b0f5e0737b57632612ce2feea07 |
|
MD5 | 1077672706e0392f0e5d00507cb7aeca |
|
BLAKE2b-256 | 719589b045ba00ca01d3d570d3142c2ea141c2cbce31ac764427686481f9e36f |