Skip to main content

Easily create and use Python Virtualenvs in Apache Airflow

Project description

Airflow

Apache Airflow virtual envs made easy

Making it easy to run tasks in isolated python virtual environments (venv) in Dockerfiles. Maintained with ❤️ by Astronomer.

Let's say you want to be able to run an Airflow task against Snowflake's Snowpark -- which requires Python 3.8.

With the addition of the ExternalPythonOperator in Airflow 2.4 this is possible, but managing the build process to get clean, quick Docker builds can take a lot of plumbing.

This repo provides a nice packaged solution to it, that plays nicely with Docker image caching.

Synopsis

Create a requirements.txt file

For example, snowpark-requirements.txt

snowflake-snowpark-python[pandas]

# To get credentials out of a connection we need these in the venv too sadly
apache-airflow
psycopg2-binary
apache-airflow-providers-snowflake

Use our custom Docker build frontend

# syntax=quay.io/astronomer/airflow-extensions:v1

FROM quay.io/astronomer/astro-runtime:7.2.0-base

PYENV 3.8 snowpark snowpark-requirements.txt

Note: That first # syntax= comment is important, don't leave it out!

Read more about the new PYENV instruction

Use it in a DAG

from __future__ import annotations

import sys

from airflow import DAG
from airflow.decorators import task
from airflow.utils.timezone import datetime

with DAG(
    dag_id="astro_snowpark",
    schedule=None,
    start_date=datetime(2022, 1, 1),
    catchup=False,
    tags=["example"],
) as dag:

    @task
    def print_python():
        print(f"My python version is {sys.version}")

    @task.venv("snowpark")
    def snowpark_task():
        from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
        from snowflake.snowpark import Session

        print(f"My python version is {sys.version}")

        hook = SnowflakeHook("snowflake_default")
        conn_params = hook._get_conn_params()
        session = Session.builder.configs(conn_params).create()
        tables = session.sql("show tables").collect()
        print(tables)

        df_table = session.table("sample_product_data")
        print(df_table.show())
        return df_table.to_pandas()

    @task
    def analyze(df):
        print(f"My python version is {sys.version}")
        print(df.head(2))

    print_python() >> analyze(snowpark_task())

Requirements

This needs Apache Airflow 2.4+ for the ExternalPythonOperator to work.

Requirements for building Docker images

This needs the buildkit backend for Docker.

It is enabled by default for Docker Desktop users; Linux users will need to enable it:

To set the BuildKit environment variable when running the docker build command, run:

DOCKER_BUILDKIT=1 docker build .

To enable docker BuildKit by default, set daemon configuration in /etc/docker/daemon.json feature to true and restart the daemon. If the daemon.json file doesn’t exist, create new file called daemon.json and then add the following to the file.

{
  "features": {
    "buildkit" : true
  }
}

And restart the Docker daemon.

The syntax extension also currently expects to find a packages.txt and requirements.txt in the Docker context directory (these can be empty by default).

Reference

PYENV Docker instruction

The PYENV command adds a Python Virtual Environment, running on the specified Python version to the docker image, and optionally install packages from a requirements.txt

It has the following syntax:

PYENV <python-version> <venv-name> [<reqs-file>]

The requirements file is optional, so one can install a bare Python environment with something like:

PYENV 3.10 venv1

@task.venv decorator

TODO! Write the decorator, then fill out docs!

In This Repo

buildkit/

This contains the cusotm Docker BuildKit frontend (see this blog for details) adds a new custom command PYENV that can be used inside Dockerfiles to install new Python versions and virtual environments with custom dependencies.

provider/

This contains an Apache Airflow provider that providers the @task.venv decorator.

The Gory Details

a.k.a. How do I do this all manually?

The # syntax line tells buildkit to user our Build frontend to process the Dockerfile into instructions.

The example Dockerfile above gets converted into roughly following instructions

USER root
COPY --link --from=python:3.8-slim /usr/local/bin/*3.8* /usr/local/bin/
COPY --link --from=python:3.8-slim /usr/local/include/python3.8* /usr/local/include/python3.8
COPY --link --from=python:3.8-slim /usr/local/lib/pkgconfig/*3.8* /usr/local/lib/pkgconfig/
COPY --link --from=python:3.8-slim /usr/local/lib/*3.8*.so* /usr/local/lib/
COPY --link --from=python:3.8-slim /usr/local/lib/python3.8 /usr/local/lib/python3.8
RUN /sbin/ldconfig /usr/local/lib
RUN ln -s /usr/local/include/python3.8 /usr/local/include/python3.8m

USER astro
RUN mkdir -p /home/astro/.venv/snowpark
COPY reqs/venv1.txt /home/astro/.venv/snowpark/requirements.txt
RUN /usr/local/bin/python3.8 -m venv --system-site-packages /home/astro/.venv/snowpark
ENV ASTRO_PYENV_snowpark /home/astro/.venv/snowpark/bin/python
RUN --mount=type=cache,target=/home/astro/.cache/pip /home/astro/.venv/snowpark/bin/pip --cache-dir=/home/astro/.cache/pip install -r /home/astro/.venv/snowpark/requirements.txt

The final part of this puzzle from the Airflow operator is to look up the path to python in the created venv using the ASTRO_PYENV_* environment variable:

@task.external_python(python=os.environ["ASTRO_PYENV_snowpark"])
def snowpark_task():
    ...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astro-provider-venv-1.0.0a3.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

astro_provider_venv-1.0.0a3-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file astro-provider-venv-1.0.0a3.tar.gz.

File metadata

  • Download URL: astro-provider-venv-1.0.0a3.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.9

File hashes

Hashes for astro-provider-venv-1.0.0a3.tar.gz
Algorithm Hash digest
SHA256 2677313e987815c9a072d28fa23679ba296db5d2db51cfb26b4aafb5e8cd08ea
MD5 ada5a86caf74657ff74018d644e84aed
BLAKE2b-256 a8adce7c2e09c12202c6840cf58faa2abd3a7f16c409a4d530389077182046bc

See more details on using hashes here.

File details

Details for the file astro_provider_venv-1.0.0a3-py3-none-any.whl.

File metadata

File hashes

Hashes for astro_provider_venv-1.0.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 c04b199ca57bfade28816d3181973b75c09b2229cde23e936f875b26c9888445
MD5 add7208fdd82ba0e48d8e13e9e4fb474
BLAKE2b-256 7b0e97e78b0abec4d8884b755dfe7f870082f2a18332c065b68ca58f567e3994

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page