Run Airflow DAGs locally and execute Dataproc/Spark jobs in local Docker instead of creating GCP clusters. Generic, zero DAG edits, pluggable test-data providers.
Project description
save-gcp-local
Stop paying for Dataproc clusters just to test your Spark jobs. Run them locally in Docker or Podman instead — same code, zero cloud cost, no DAG changes.
Why this exists
Testing Spark jobs on GCP Dataproc is slow and expensive. Every small code change means:
- Trigger the DAG
- Wait for a cluster to spin up (1–3 min)
- Run the job on full data (often 30–40 min)
- Tear the cluster down
- Find a bug -> repeat — and pay for all of it
The cluster minutes add up fast, especially across a whole team iterating all day.
save-gcp-local removes the cluster entirely. It intercepts the Dataproc steps in your local Airflow and runs the same Spark job in a local container. You iterate in seconds for free, then do one real Dataproc run at the end to confirm scale.
Can you run Dataproc itself locally? No — Dataproc is GCP infrastructure. But your job is plain Apache Spark, which has a built-in local mode. This tool no-ops the cluster steps and runs your job locally. That is the whole trick, and it is enough to save the money.
What you save
| Step | On Dataproc | Locally |
|---|---|---|
| Cluster create | 1–3 min + $ | skipped, $0 |
| Job run | 30–40 min + $ | seconds–min, $0 |
| Cluster delete | ~1 min + $ | skipped, $0 |
| Per iteration | ~40 min + cluster cost | ~minutes, free |
Key features
- Zero DAG edits — works by patching Dataproc operators at runtime
- Generic — any Dataproc operator, PySpark or Scala/Java JARs, any project layout
- Docker or Podman (or a local
spark-submit) — auto-detected - Jobs anywhere — in the Airflow repo, a subfolder, a JAR, or a separate repo
- Test data your way — none / real-data sample / synthetic / your own provider
- Two entry points — a CLI and an auto-loading Airflow plugin
- One switch to go back to GCP —
DPL_ENABLED=false
Install
pip install "save-gcp-local[all]" # from PyPI (when published)
# or from source:
git clone https://github.com/EshwarCVS/save-gcp-local
cd save-gcp-local && pip install -e ".[all]"
60-second start
# 1. Point at your test data (jobs inside the Airflow repo are auto-found)
export DPL_DATA_DIR=./data
# 2. (optional) make test data — pick ONE
save-gcp-local gen-data --provider sample --input prod.csv --output ./data/events.csv --pct 1
save-gcp-local gen-data --provider synthetic --input prod.csv --output ./data/events.csv --rows 200000
# 3. run your DAG locally — Dataproc steps run in a container
save-gcp-local run --dags ./dags --dag my_pipeline --execution-date 2024-06-01
Prefer the UI? Drop a one-liner into $AIRFLOW_HOME/plugins/ and boot Airflow normally — see QUICKSTART.md.
Documentation
- QUICKSTART.md — 5-minute setup
- SETUP.md — full guide: install options, config, both entry points, test-data strategies, troubleshooting
- CONTRIBUTING.md — dev setup, tests, how to add a data provider
How it works
+--------------- your local Airflow ---------------+
| |
DAG ---> CreateCluster -> SubmitJob -> DeleteCluster |
| (no-op) | (no-op) |
| +-- runs in Docker/Podman --+ |
+-------------------+--------------------------+----+
v
spark-submit --master local[*]
with /data, /jobs, /output mounted in
Cluster lifecycle operators become no-ops. Job-submit operators run your Spark code in a local container with your job files and test data mounted in.
Supported operators
Cluster lifecycle (no-op): DataprocCreateClusterOperator, DataprocDeleteClusterOperator, DataprocUpdate/Start/StopClusterOperator, workflow-template operators.
Job submission (runs locally): DataprocSubmitJobOperator, DataprocCreateBatchOperator, and legacy DataprocSubmitPySparkJobOperator / SparkJobOperator / SparkSqlJobOperator / HadoopJobOperator.
Limitations (be honest with your team)
- Local Spark is a single machine — validate logic locally, scale on GCP once.
- Absolute row counts / huge-shuffle behavior will not match production.
- If a job hardcodes
gs:///BigQuery paths inside the code (not as an argument), parameterize the input so it can point at/data.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file save_gcp_local-0.1.0.tar.gz.
File metadata
- Download URL: save_gcp_local-0.1.0.tar.gz
- Upload date:
- Size: 27.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
edcf929138c1098a28db8ad19c29f3be981bce7e3b28d05c17dad61d2b202dff
|
|
| MD5 |
251440e1513864b08285af50e3ebe212
|
|
| BLAKE2b-256 |
861c0be76abc4742769ec5ec40c26157cc026bac5bdd8b308ded83dc98d54ecf
|
Provenance
The following attestation bundles were made for save_gcp_local-0.1.0.tar.gz:
Publisher:
release.yml on EshwarCVS/save-gcp-local
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
save_gcp_local-0.1.0.tar.gz -
Subject digest:
edcf929138c1098a28db8ad19c29f3be981bce7e3b28d05c17dad61d2b202dff - Sigstore transparency entry: 1715235388
- Sigstore integration time:
-
Permalink:
EshwarCVS/save-gcp-local@872a90aae1861d36cd784f19a49eeb5c0781eaf7 -
Branch / Tag:
refs/tags/0.1.2 - Owner: https://github.com/EshwarCVS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@872a90aae1861d36cd784f19a49eeb5c0781eaf7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file save_gcp_local-0.1.0-py3-none-any.whl.
File metadata
- Download URL: save_gcp_local-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f993023cc512f612b39146e7f15e556fd61bfc33cdeb8704f28f8313dbf5086
|
|
| MD5 |
4c76f492707a59f8ed1dc693b9550493
|
|
| BLAKE2b-256 |
8733332d650ddd24c235cbba226255b7ddf4e9292695d13944dc1fef8b480fd3
|
Provenance
The following attestation bundles were made for save_gcp_local-0.1.0-py3-none-any.whl:
Publisher:
release.yml on EshwarCVS/save-gcp-local
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
save_gcp_local-0.1.0-py3-none-any.whl -
Subject digest:
4f993023cc512f612b39146e7f15e556fd61bfc33cdeb8704f28f8313dbf5086 - Sigstore transparency entry: 1715235498
- Sigstore integration time:
-
Permalink:
EshwarCVS/save-gcp-local@872a90aae1861d36cd784f19a49eeb5c0781eaf7 -
Branch / Tag:
refs/tags/0.1.2 - Owner: https://github.com/EshwarCVS
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@872a90aae1861d36cd784f19a49eeb5c0781eaf7 -
Trigger Event:
release
-
Statement type: