Skip to main content

Spark-agnostic Airflow task group for running jobs on AWS Glue, Databricks, or Wherobots from a single DAG entry point.

Project description

overture-airflow-provider

An Apache Airflow provider exposing a Spark-agnostic task group that runs PySpark or Scala/Spark jobs on AWS Glue, Databricks, or Wherobots Cloud — from a single, unified DAG-level API.

Write your DAG once, target any of the supported engines by switching a single spark_impl argument. Cluster shape, Iceberg catalog wiring, JAR / wheel distribution, and per-platform cluster init are all handled for you.

This project is OSS and intentionally unopinionated: every environment-specific value (S3 buckets, IAM roles, catalog endpoints, package registries) is passed in via typed config dataclasses. No defaults are baked in for any one organization.

Status: 0.1.0 — initial MVP. Unit + mock test coverage only; live-platform E2E tests are tracked as a follow-up.

Install

pip install overture-airflow-provider

Optional extras for platforms that need extra SDKs:

pip install "overture-airflow-provider[databricks]"
pip install "overture-airflow-provider[wherobots]"
pip install "overture-airflow-provider[all]"

Requires Python >=3.11 and Apache Airflow >=2.11.

Supported versions

Provider requirements

Minimum Also tested
Python 3.11 3.12, 3.13
Apache Airflow 2.11 3.x

Spark platform matrix

Pass one of these names as spark_impl_name:

spark_impl_name Platform Spark Scala Python runtime
GLUE_v4 AWS Glue 4.0 3.3.0 2.12 3.10
GLUE_v5 AWS Glue 5.0 3.5.2 2.12 3.11
DATABRICKS_v14 Databricks Runtime 14.3 LTS 3.5.0 2.12 3.10.12
DATABRICKS_v15 Databricks Runtime 15.4 LTS 3.5.0 2.12 3.11.0
WHEROBOTS_v1_5_0 Wherobots Cloud 1.5.0 3.5.0 2.12 3.11

SYNAPSE_v3_3_1 / SYNAPSE_v3_4_1 are defined but not yet active (Azure Synapse support reserved).

Apache Sedona (optional)

Sedona JARs are resolved from Maven Central at runtime. Tested pairings and the minimum Spark version required:

Sedona geotools-wrapper Min Spark
1.5.3 28.2 3.3
1.6.1 28.2 3.3
1.7.0 28.5 3.3
1.7.1 28.5 3.3
1.7.2 28.5 3.3
1.8.1 33.1 3.4 (Spark 3.3 dropped)
1.9.0 33.5 3.4 (Spark 3.3 dropped)

Quick start

from datetime import datetime

from airflow import DAG

from overture_airflow_provider import (
    ArtifactStoreConfig,
    AwsGlueClusterSize,
    GlueConfig,
    IcebergConfig,
    PackageRegistryConfig,
    spark_agnostic_task_group,
)

with DAG(
    dag_id="example_spark_agnostic",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:
    spark_agnostic_task_group(
        group_id="my_spark_job",
        spark_impl_name="GLUE_v5",
        sedona_version="1.7.0",
        module_name="my_pkg.jobs",
        class_name="MyJob",
        python_packages="my-pkg==1.0.0",
        parameters={"s3_input": "s3://example-bucket/in/", "s3_output": "s3://example-bucket/out/"},
        spark_cluster_size=AwsGlueClusterSize.G_2X.name,
        artifact_store=ArtifactStoreConfig(
            s3_bucket="example-bucket",
            s3_root="spark-agnostic-operator",
        ),
        package_registry=PackageRegistryConfig(
            domain="my-domain",
            domain_owner="123456789012",
            repository="my-pypi",
            region="us-east-1",
        ),
        glue_config=GlueConfig(iam_role_name="AWSGlueServiceRole"),
        iceberg_config=IcebergConfig(spark_config="{}"),
    )

Switching to Databricks or Wherobots is just a different spark_impl_name plus the corresponding DatabricksConfig / WherobotsConfig dataclass — the surrounding DAG code does not change.

See examples/example_dag.py for a runnable DAG that targets all three platforms.

See SPEC.md for the full architecture.

Local rendering (testing without Airflow)

The overture_airflow_provider.render module produces the exact platform submission payload that the task group would emit, without importing or executing any Airflow operators. Use it to drive real cloud resources from the CLI, or to snapshot-test payload shape in CI.

# Render the payload to stdout as JSON.
uv run python -m overture_airflow_provider.render \
    --spark-impl GLUE_v5 --module-name my_module --class-name MyJob

# Render to a directory and emit an executable cli.sh.
uv run python -m overture_airflow_provider.render \
    --spark-impl GLUE_v5 --module-name my_module --class-name MyJob \
    --out ./rendered/

bash ./rendered/cli.sh   # invokes aws glue create-job / start-job-run

You can also drive it programmatically:

from overture_airflow_provider import render_spark_job

result = render_spark_job(
    spark_impl_name="DATABRICKS_v15",
    module_name="my_module",
    class_name="MyJob",
    parameters={"date": "2024-01-01"},
)
print(result.submit_payload)        # equivalent to `databricks jobs submit --json`
print(result.operator_kwargs)       # what the Airflow operator would receive
result.write_to("./out/")           # dump JSON payloads + cli.sh

Pass pre_resolved_package_info= / pre_resolved_jar_info= with real S3 URIs from a previous download_python_packages_* / download_jars_* run if you want to skip the s3://.../REPLACE-ME.whl placeholders.

Development

uv sync --all-extras --group dev
uv run pytest -v
uv run ruff check .
uv run ruff format --check .

Airflow version: supports Airflow 2.11.x and 3.x via a tiny compat shim (_airflow_compat.py) that re-exports DAG, task, task_group, and BaseHook from whichever location exists on the installed Airflow. When dropping 2.x support, simplify the shim to just the airflow.sdk imports (or inline them).

Windows note: Apache Airflow does not officially support Windows (warning emitted at import time). Tests, lint, and the render module all work, but production deployments should run on Linux or macOS.

See CONTRIBUTING.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

overture_airflow_provider-0.1.0.tar.gz (54.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

overture_airflow_provider-0.1.0-py3-none-any.whl (68.6 kB view details)

Uploaded Python 3

File details

Details for the file overture_airflow_provider-0.1.0.tar.gz.

File metadata

File hashes

Hashes for overture_airflow_provider-0.1.0.tar.gz
Algorithm Hash digest
SHA256 eca4d9b84b8f034a576cd1e404cb075bdfa08676cfc26c784e3455f0270f82f5
MD5 c8eccae600d9a315d733a4cecf9ace27
BLAKE2b-256 c46a03d6f6f1db69f092a6b22aec8d917fc04ed12eabd165eabe94e4a6711ec9

See more details on using hashes here.

Provenance

The following attestation bundles were made for overture_airflow_provider-0.1.0.tar.gz:

Publisher: publish-pypi.yml on OvertureMaps/overture-airflow-provider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file overture_airflow_provider-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for overture_airflow_provider-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1f947f85165e8ddd270a1bb16f68dc17e6922c6f79aae5ce3e6b15ae3adc57a
MD5 c24ddb94f43532f6f3167d634f55b36e
BLAKE2b-256 637371e8e8c8800e77df9a3667cf4b3942d0969ba1ae06092944d94421bfaf7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for overture_airflow_provider-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on OvertureMaps/overture-airflow-provider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page