Spark-agnostic Airflow task group for running jobs on AWS Glue, Databricks, or Wherobots from a single DAG entry point.
Project description
overture-airflow-provider
An Apache Airflow provider exposing a Spark-agnostic task group that runs PySpark or Scala/Spark jobs on AWS Glue, Databricks, or Wherobots Cloud — from a single, unified DAG-level API.
Write your DAG once, target any of the supported engines by switching a single
spark_impl argument. Cluster shape, Iceberg catalog wiring, JAR / wheel
distribution, and per-platform cluster init are all handled for you.
This project is OSS and intentionally unopinionated: every environment-specific value (S3 buckets, IAM roles, catalog endpoints, package registries) is passed in via typed config dataclasses. No defaults are baked in for any one organization.
Status:
0.1.0— initial MVP. Unit + mock test coverage only; live-platform E2E tests are tracked as a follow-up.
Install
pip install overture-airflow-provider
Optional extras for platforms that need extra SDKs:
pip install "overture-airflow-provider[databricks]"
pip install "overture-airflow-provider[wherobots]"
pip install "overture-airflow-provider[all]"
Requires Python >=3.11 and Apache Airflow >=2.11.
Supported versions
Provider requirements
| Minimum | Also tested | |
|---|---|---|
| Python | 3.11 | 3.12, 3.13 |
| Apache Airflow | 2.11 | 3.x |
Spark platform matrix
Pass one of these names as spark_impl_name:
spark_impl_name |
Platform | Spark | Scala | Python runtime |
|---|---|---|---|---|
GLUE_v4 |
AWS Glue 4.0 | 3.3.0 | 2.12 | 3.10 |
GLUE_v5 |
AWS Glue 5.0 | 3.5.2 | 2.12 | 3.11 |
DATABRICKS_v14 |
Databricks Runtime 14.3 LTS | 3.5.0 | 2.12 | 3.10.12 |
DATABRICKS_v15 |
Databricks Runtime 15.4 LTS | 3.5.0 | 2.12 | 3.11.0 |
WHEROBOTS_v1_5_0 |
Wherobots Cloud 1.5.0 | 3.5.0 | 2.12 | 3.11 |
SYNAPSE_v3_3_1/SYNAPSE_v3_4_1are defined but not yet active (Azure Synapse support reserved).
Apache Sedona (optional)
Sedona JARs are resolved from Maven Central at runtime. Tested pairings and the minimum Spark version required:
| Sedona | geotools-wrapper | Min Spark |
|---|---|---|
| 1.5.3 | 28.2 | 3.3 |
| 1.6.1 | 28.2 | 3.3 |
| 1.7.0 | 28.5 | 3.3 |
| 1.7.1 | 28.5 | 3.3 |
| 1.7.2 | 28.5 | 3.3 |
| 1.8.1 | 33.1 | 3.4 (Spark 3.3 dropped) |
| 1.9.0 | 33.5 | 3.4 (Spark 3.3 dropped) |
Quick start
from datetime import datetime
from airflow import DAG
from overture_airflow_provider import (
ArtifactStoreConfig,
AwsGlueClusterSize,
GlueConfig,
IcebergConfig,
PackageRegistryConfig,
spark_agnostic_task_group,
)
with DAG(
dag_id="example_spark_agnostic",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
) as dag:
spark_agnostic_task_group(
group_id="my_spark_job",
spark_impl_name="GLUE_v5",
sedona_version="1.7.0",
module_name="my_pkg.jobs",
class_name="MyJob",
python_packages="my-pkg==1.0.0",
parameters={"s3_input": "s3://example-bucket/in/", "s3_output": "s3://example-bucket/out/"},
spark_cluster_size=AwsGlueClusterSize.G_2X.name,
artifact_store=ArtifactStoreConfig(
s3_bucket="example-bucket",
s3_root="spark-agnostic-operator",
),
package_registry=PackageRegistryConfig(
domain="my-domain",
domain_owner="123456789012",
repository="my-pypi",
region="us-east-1",
),
glue_config=GlueConfig(iam_role_name="AWSGlueServiceRole"),
iceberg_config=IcebergConfig(spark_config="{}"),
)
Switching to Databricks or Wherobots is just a different spark_impl_name plus
the corresponding DatabricksConfig / WherobotsConfig dataclass — the
surrounding DAG code does not change.
See examples/example_dag.py for a runnable DAG
that targets all three platforms.
See SPEC.md for the full architecture.
Local rendering (testing without Airflow)
The overture_airflow_provider.render module produces the exact platform
submission payload that the task group would emit, without importing or
executing any Airflow operators. Use it to drive real cloud resources from
the CLI, or to snapshot-test payload shape in CI.
# Render the payload to stdout as JSON.
uv run python -m overture_airflow_provider.render \
--spark-impl GLUE_v5 --module-name my_module --class-name MyJob
# Render to a directory and emit an executable cli.sh.
uv run python -m overture_airflow_provider.render \
--spark-impl GLUE_v5 --module-name my_module --class-name MyJob \
--out ./rendered/
bash ./rendered/cli.sh # invokes aws glue create-job / start-job-run
You can also drive it programmatically:
from overture_airflow_provider import render_spark_job
result = render_spark_job(
spark_impl_name="DATABRICKS_v15",
module_name="my_module",
class_name="MyJob",
parameters={"date": "2024-01-01"},
)
print(result.submit_payload) # equivalent to `databricks jobs submit --json`
print(result.operator_kwargs) # what the Airflow operator would receive
result.write_to("./out/") # dump JSON payloads + cli.sh
Pass pre_resolved_package_info= / pre_resolved_jar_info= with real S3
URIs from a previous download_python_packages_* / download_jars_* run if
you want to skip the s3://.../REPLACE-ME.whl placeholders.
Development
uv sync --all-extras --group dev
uv run pytest -v
uv run ruff check .
uv run ruff format --check .
Airflow version: supports Airflow 2.11.x and 3.x via a tiny compat shim (
_airflow_compat.py) that re-exportsDAG,task,task_group, andBaseHookfrom whichever location exists on the installed Airflow. When dropping 2.x support, simplify the shim to just theairflow.sdkimports (or inline them).Windows note: Apache Airflow does not officially support Windows (warning emitted at import time). Tests, lint, and the render module all work, but production deployments should run on Linux or macOS.
See CONTRIBUTING.md.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file overture_airflow_provider-0.1.0.tar.gz.
File metadata
- Download URL: overture_airflow_provider-0.1.0.tar.gz
- Upload date:
- Size: 54.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eca4d9b84b8f034a576cd1e404cb075bdfa08676cfc26c784e3455f0270f82f5
|
|
| MD5 |
c8eccae600d9a315d733a4cecf9ace27
|
|
| BLAKE2b-256 |
c46a03d6f6f1db69f092a6b22aec8d917fc04ed12eabd165eabe94e4a6711ec9
|
Provenance
The following attestation bundles were made for overture_airflow_provider-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on OvertureMaps/overture-airflow-provider
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
overture_airflow_provider-0.1.0.tar.gz -
Subject digest:
eca4d9b84b8f034a576cd1e404cb075bdfa08676cfc26c784e3455f0270f82f5 - Sigstore transparency entry: 1697298950
- Sigstore integration time:
-
Permalink:
OvertureMaps/overture-airflow-provider@192909c07d37d26eae82205c7dba4f21c734e8fe -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/OvertureMaps
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@192909c07d37d26eae82205c7dba4f21c734e8fe -
Trigger Event:
release
-
Statement type:
File details
Details for the file overture_airflow_provider-0.1.0-py3-none-any.whl.
File metadata
- Download URL: overture_airflow_provider-0.1.0-py3-none-any.whl
- Upload date:
- Size: 68.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1f947f85165e8ddd270a1bb16f68dc17e6922c6f79aae5ce3e6b15ae3adc57a
|
|
| MD5 |
c24ddb94f43532f6f3167d634f55b36e
|
|
| BLAKE2b-256 |
637371e8e8c8800e77df9a3667cf4b3942d0969ba1ae06092944d94421bfaf7d
|
Provenance
The following attestation bundles were made for overture_airflow_provider-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on OvertureMaps/overture-airflow-provider
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
overture_airflow_provider-0.1.0-py3-none-any.whl -
Subject digest:
f1f947f85165e8ddd270a1bb16f68dc17e6922c6f79aae5ce3e6b15ae3adc57a - Sigstore transparency entry: 1697299794
- Sigstore integration time:
-
Permalink:
OvertureMaps/overture-airflow-provider@192909c07d37d26eae82205c7dba4f21c734e8fe -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/OvertureMaps
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@192909c07d37d26eae82205c7dba4f21c734e8fe -
Trigger Event:
release
-
Statement type: