Skip to main content

Spark-agnostic Airflow task group for running jobs on AWS Glue, Databricks, or Wherobots from a single DAG entry point.

Project description

overture-airflow-provider

An Apache Airflow provider exposing a Spark-agnostic task group that runs PySpark or Scala/Spark jobs on AWS Glue, Databricks, or Wherobots Cloud — from a single, unified DAG-level API.

Write your DAG once, target any of the supported engines by switching a single spark_impl argument. Cluster shape, Iceberg catalog wiring, JAR / wheel distribution, and per-platform cluster init are all handled for you.

This project is OSS and intentionally unopinionated: every environment-specific value (S3 buckets, IAM roles, catalog endpoints, package registries) is passed in via typed config dataclasses. No defaults are baked in for any one organization.

Status: 0.1.0 — initial MVP. Unit + mock test coverage only; live-platform E2E tests are tracked as a follow-up.

Install

pip install overture-airflow-provider

Optional extras for platforms that need extra SDKs:

pip install "overture-airflow-provider[databricks]"
pip install "overture-airflow-provider[wherobots]"
pip install "overture-airflow-provider[all]"

Requires Python >=3.11 and Apache Airflow >=2.11.

Supported versions

Provider requirements

Minimum Also tested
Python 3.11 3.12, 3.13
Apache Airflow 2.11 3.x

Spark platform matrix

Pass one of these names as spark_impl_name:

spark_impl_name Platform Spark Scala Python runtime
GLUE_v4 AWS Glue 4.0 3.3.0 2.12 3.10
GLUE_v5 AWS Glue 5.0 3.5.2 2.12 3.11
DATABRICKS_v14 Databricks Runtime 14.3 LTS 3.5.0 2.12 3.10.12
DATABRICKS_v15 Databricks Runtime 15.4 LTS 3.5.0 2.12 3.11.0
WHEROBOTS_v1_5_0 Wherobots Cloud 1.5.0 3.5.0 2.12 3.11

SYNAPSE_v3_3_1 / SYNAPSE_v3_4_1 are defined but not yet active (Azure Synapse support reserved).

Apache Sedona (optional)

Sedona JARs are resolved from Maven Central at runtime. Tested pairings and the minimum Spark version required:

Sedona geotools-wrapper Min Spark
1.5.3 28.2 3.3
1.6.1 28.2 3.3
1.7.0 28.5 3.3
1.7.1 28.5 3.3
1.7.2 28.5 3.3
1.8.1 33.1 3.4 (Spark 3.3 dropped)
1.9.0 33.5 3.4 (Spark 3.3 dropped)

Quick start

from datetime import datetime

from airflow import DAG

from overture_airflow_provider import (
    ArtifactStoreConfig,
    AwsGlueClusterSize,
    GlueConfig,
    IcebergConfig,
    PackageRegistryConfig,
    spark_agnostic_task_group,
)

with DAG(
    dag_id="example_spark_agnostic",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:
    spark_agnostic_task_group(
        group_id="my_spark_job",
        spark_impl_name="GLUE_v5",
        sedona_version="1.7.0",
        module_name="my_pkg.jobs",
        class_name="MyJob",
        python_packages="my-pkg==1.0.0",
        parameters={"s3_input": "s3://example-bucket/in/", "s3_output": "s3://example-bucket/out/"},
        spark_cluster_size=AwsGlueClusterSize.G_2X.name,
        artifact_store=ArtifactStoreConfig(
            s3_bucket="example-bucket",
            s3_root="spark-agnostic-operator",
        ),
        package_registry=PackageRegistryConfig(
            domain="my-domain",
            domain_owner="123456789012",
            repository="my-pypi",
            region="us-east-1",
        ),
        glue_config=GlueConfig(iam_role_name="AWSGlueServiceRole"),
        iceberg_config=IcebergConfig(spark_config="{}"),
    )

Switching to Databricks or Wherobots is just a different spark_impl_name plus the corresponding DatabricksConfig / WherobotsConfig dataclass — the surrounding DAG code does not change.

See examples/example_dag.py for a runnable DAG that targets all three platforms.

See SPEC.md for the full architecture.

Local rendering (testing without Airflow)

The overture_airflow_provider.render module produces the exact platform submission payload that the task group would emit, without importing or executing any Airflow operators. Use it to drive real cloud resources from the CLI, or to snapshot-test payload shape in CI.

# Render the payload to stdout as JSON.
uv run python -m overture_airflow_provider.render \
    --spark-impl GLUE_v5 --module-name my_module --class-name MyJob

# Render to a directory and emit an executable cli.sh.
uv run python -m overture_airflow_provider.render \
    --spark-impl GLUE_v5 --module-name my_module --class-name MyJob \
    --out ./rendered/

bash ./rendered/cli.sh   # invokes aws glue create-job / start-job-run

You can also drive it programmatically:

from overture_airflow_provider import render_spark_job

result = render_spark_job(
    spark_impl_name="DATABRICKS_v15",
    module_name="my_module",
    class_name="MyJob",
    parameters={"date": "2024-01-01"},
)
print(result.submit_payload)        # equivalent to `databricks jobs submit --json`
print(result.operator_kwargs)       # what the Airflow operator would receive
result.write_to("./out/")           # dump JSON payloads + cli.sh

Pass pre_resolved_package_info= / pre_resolved_jar_info= with real S3 URIs from a previous download_python_packages_* / download_jars_* run if you want to skip the s3://.../REPLACE-ME.whl placeholders.

Development

uv sync --all-extras --group dev
uv run pytest -v
uv run ruff check .
uv run ruff format --check .

Airflow version: supports Airflow 2.11.x and 3.x via a tiny compat shim (_airflow_compat.py) that re-exports DAG, task, task_group, and BaseHook from whichever location exists on the installed Airflow. When dropping 2.x support, simplify the shim to just the airflow.sdk imports (or inline them).

Windows note: Apache Airflow does not officially support Windows (warning emitted at import time). Tests, lint, and the render module all work, but production deployments should run on Linux or macOS.

See CONTRIBUTING.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

overture_airflow_provider-0.1.3.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

overture_airflow_provider-0.1.3-py3-none-any.whl (69.4 kB view details)

Uploaded Python 3

File details

Details for the file overture_airflow_provider-0.1.3.tar.gz.

File metadata

File hashes

Hashes for overture_airflow_provider-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b9f3cf45ba962c2050471aa65c953617d4575261f7493e0ca4e0d7b61c899635
MD5 5c3b45ae0c1e8c7ab4d897d1e2e6c24e
BLAKE2b-256 e21db18ba4757e57e48361b8bde5991da3702b3c0dc10888d709772047f0b888

See more details on using hashes here.

Provenance

The following attestation bundles were made for overture_airflow_provider-0.1.3.tar.gz:

Publisher: publish-pypi.yml on OvertureMaps/overture-airflow-provider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file overture_airflow_provider-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for overture_airflow_provider-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8d8af736e688d66a4ed10ac8594805309bfa147314f2d4871009011c3aa926c6
MD5 a618479f2ba7b14a0a9f373b89141ccf
BLAKE2b-256 25938d9f077bdf3ac85c251aede35bb1b39c94fdbdca4af175eee1967c38a567

See more details on using hashes here.

Provenance

The following attestation bundles were made for overture_airflow_provider-0.1.3-py3-none-any.whl:

Publisher: publish-pypi.yml on OvertureMaps/overture-airflow-provider

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page