Skip to main content

Easily create and use Python Virtualenvs in Apache Airflow

Project description

Airflow

Apache Airflow virtual envs made easy

Making it easy to run tasks in isolated python virtual environments (venv) in Dockerfiles. Maintained with ❤️ by Astronomer.

Let's say you want to be able to run an Airflow task against Snowflake's Snowpark -- which requires Python 3.8.

With the addition of the ExternalPythonOperator in Airflow 2.4 this is possible, but managing the build process to get clean, quick Docker builds can take a lot of plumbing.

This repo provides a nice packaged solution to it, that plays nicely with Docker image caching.

Synopsis

Create a requirements.txt file

For example, snowpark-requirements.txt

snowflake-snowpark-python[pandas]

# To get credentials out of a connection we need these in the venv too sadly
apache-airflow
psycopg2-binary
apache-airflow-providers-snowflake

Use our custom Docker build frontend

# syntax=quay.io/astronomer/airflow-extensions:v1

FROM quay.io/astronomer/astro-runtime:7.2.0-base

PYENV 3.8 snowpark snowpark-requirements.txt

Note: That first # syntax= comment is important, don't leave it out!

Read more about the new PYENV instruction

Use it in a DAG

from __future__ import annotations

import sys

from airflow import DAG
from airflow.decorators import task
from airflow.utils.timezone import datetime

with DAG(
    dag_id="astro_snowpark",
    schedule=None,
    start_date=datetime(2022, 1, 1),
    catchup=False,
    tags=["example"],
) as dag:

    @task
    def print_python():
        print(f"My python version is {sys.version}")

    @task.venv("snowpark")
    def snowpark_task():
        from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
        from snowflake.snowpark import Session

        print(f"My python version is {sys.version}")

        hook = SnowflakeHook("snowflake_default")
        conn_params = hook._get_conn_params()
        session = Session.builder.configs(conn_params).create()
        tables = session.sql("show tables").collect()
        print(tables)

        df_table = session.table("sample_product_data")
        print(df_table.show())
        return df_table.to_pandas()

    @task
    def analyze(df):
        print(f"My python version is {sys.version}")
        print(df.head(2))

    print_python() >> analyze(snowpark_task())

Requirements

This needs Apache Airflow 2.4+ for the ExternalPythonOperator to work.

Requirements for building Docker images

This needs the buildkit backend for Docker.

It is enabled by default for Docker Desktop users; Linux users will need to enable it:

To set the BuildKit environment variable when running the docker build command, run:

DOCKER_BUILDKIT=1 docker build .

To enable docker BuildKit by default, set daemon configuration in /etc/docker/daemon.json feature to true and restart the daemon. If the daemon.json file doesn’t exist, create new file called daemon.json and then add the following to the file.

{
  "features": {
    "buildkit" : true
  }
}

And restart the Docker daemon.

The syntax extension also currently expects to find a packages.txt and requirements.txt in the Docker context directory (these can be empty by default).

Reference

PYENV Docker instruction

The PYENV command adds a Python Virtual Environment, running on the specified Python version to the docker image, and optionally install packages from a requirements.txt

It has the following syntax:

PYENV <python-version> <venv-name> [<reqs-file>]

The requirements file is optional, so one can install a bare Python environment with something like:

PYENV 3.10 venv1

@task.venv decorator

TODO! Write the decorator, then fill out docs!

In This Repo

buildkit/

This contains the cusotm Docker BuildKit frontend (see this blog for details) adds a new custom command PYENV that can be used inside Dockerfiles to install new Python versions and virtual environments with custom dependencies.

provider/

This contains an Apache Airflow provider that providers the @task.venv decorator.

The Gory Details

a.k.a. How do I do this all manually?

The # syntax line tells buildkit to user our Build frontend to process the Dockerfile into instructions.

The example Dockerfile above gets converted into roughly following instructions

USER root
COPY --link --from=python:3.8-slim /usr/local/bin/*3.8* /usr/local/bin/
COPY --link --from=python:3.8-slim /usr/local/include/python3.8* /usr/local/include/python3.8
COPY --link --from=python:3.8-slim /usr/local/lib/pkgconfig/*3.8* /usr/local/lib/pkgconfig/
COPY --link --from=python:3.8-slim /usr/local/lib/*3.8*.so* /usr/local/lib/
COPY --link --from=python:3.8-slim /usr/local/lib/python3.8 /usr/local/lib/python3.8
RUN /sbin/ldconfig /usr/local/lib
RUN ln -s /usr/local/include/python3.8 /usr/local/include/python3.8m

USER astro
RUN mkdir -p /home/astro/.venv/snowpark
COPY reqs/venv1.txt /home/astro/.venv/snowpark/requirements.txt
RUN /usr/local/bin/python3.8 -m venv --system-site-packages /home/astro/.venv/snowpark
ENV ASTRO_PYENV_snowpark /home/astro/.venv/snowpark/bin/python
RUN --mount=type=cache,target=/home/astro/.cache/pip /home/astro/.venv/snowpark/bin/pip --cache-dir=/home/astro/.cache/pip install -r /home/astro/.venv/snowpark/requirements.txt

The final part of this puzzle from the Airflow operator is to look up the path to python in the created venv using the ASTRO_PYENV_* environment variable:

@task.external_python(python=os.environ["ASTRO_PYENV_snowpark"])
def snowpark_task():
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astro-provider-venv-1.0.0a3.tar.gz (10.3 kB view hashes)

Uploaded Source

Built Distribution

astro_provider_venv-1.0.0a3-py3-none-any.whl (9.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page