Run Airflow DAGs locally and execute Dataproc/Spark jobs in local Docker instead of creating GCP clusters. Generic, zero DAG edits, pluggable test-data providers.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ethedla

These details have not been verified by PyPI

Project description

save-gcp-local

Stop paying for Dataproc clusters just to test your Spark jobs. Run them locally in Docker or Podman instead — same code, zero cloud cost, no DAG changes.

Why this exists

Testing Spark jobs on GCP Dataproc is slow and expensive. Every small code change means:

Trigger the DAG
Wait for a cluster to spin up (1–3 min)
Run the job on full data (often 30–40 min)
Tear the cluster down
Find a bug -> repeat — and pay for all of it

The cluster minutes add up fast, especially across a whole team iterating all day.

save-gcp-local removes the cluster entirely. It intercepts the Dataproc steps in your local Airflow and runs the same Spark job in a local container. You iterate in seconds for free, then do one real Dataproc run at the end to confirm scale.

Can you run Dataproc itself locally? No — Dataproc is GCP infrastructure. But your job is plain Apache Spark, which has a built-in local mode. This tool no-ops the cluster steps and runs your job locally. That is the whole trick, and it is enough to save the money.

What you save

Step	On Dataproc	Locally
Cluster create	1–3 min + $	skipped, $0
Job run	30–40 min + $	seconds–min, $0
Cluster delete	~1 min + $	skipped, $0
Per iteration	~40 min + cluster cost	~minutes, free

Key features

Zero DAG edits — works by patching Dataproc operators at runtime
Generic — any Dataproc operator, PySpark or Scala/Java JARs, any project layout
Docker or Podman (or a local spark-submit) — auto-detected, daemon health checked
Jobs anywhere — in the Airflow repo, a subfolder, a JAR, or a separate repo
Test data your way — none / real-data sample / synthetic / your own provider
Custom operator subclasses — patch internal wrappers via DPL_EXTRA_*_OPERATORS
Airflow 2.x and 3.x — plugin for 2.x, early-patch .pth for 3.x
Missing google provider — installs mock stubs so DAGs still import and parse
One switch to go back to GCP — DPL_ENABLED=false

Install

pip install "save-gcp-local[all]"        # from PyPI (when published)
# or from source:
git clone https://github.com/EshwarCVS/save-gcp-local
cd save-gcp-local && pip install -e ".[all]"

60-second start

# 1. Point at your test data (jobs inside the Airflow repo are auto-found)
export DPL_DATA_DIR=./data

# 2. (optional) make test data — pick ONE
save-gcp-local gen-data --provider sample    --input prod.csv --output ./data/events.csv --pct 1
save-gcp-local gen-data --provider synthetic --input prod.csv --output ./data/events.csv --rows 200000

# 3. run your DAG locally — Dataproc steps run in a container
save-gcp-local run --dags ./dags --dag my_pipeline --execution-date 2024-06-01

Prefer the UI? Drop a one-liner into $AIRFLOW_HOME/plugins/ and boot Airflow normally — see QUICKSTART.md.

Documentation

QUICKSTART.md — 5-minute setup
SETUP.md — full guide: install options, config, both entry points, test-data strategies, troubleshooting
CICD.md — CI/CD pipeline, release process, branch protection
CONTRIBUTING.md — dev setup, tests, how to add a data provider
Docs site — full documentation website

How it works

            +--------------- your local Airflow ---------------+
            |                                                   |
  DAG --->  CreateCluster -> SubmitJob -> DeleteCluster         |
            |   (no-op)         |            (no-op)            |
            |                   +-- runs in Docker/Podman --+   |
            +-------------------+--------------------------+----+
                                v
                       spark-submit --master local[*]
                       with /data, /jobs, /output mounted in

Cluster lifecycle operators become no-ops. Job-submit operators run your Spark code in a local container with your job files and test data mounted in.

Supported operators

Cluster lifecycle (no-op): DataprocCreateClusterOperator, DataprocDeleteClusterOperator, DataprocUpdate/Start/StopClusterOperator, workflow-template operators, DataprocSubmitHiveJobOperator.

Job submission (runs locally): DataprocSubmitJobOperator, DataprocCreateBatchOperator, and legacy DataprocSubmitPySparkJobOperator / SparkJobOperator / SparkSqlJobOperator / HadoopJobOperator.

Custom operator subclasses (e.g. internal wrappers that extend the base operators) can be patched via DPL_EXTRA_NOOP_OPERATORS and DPL_EXTRA_SUBMIT_OPERATORS — see SETUP.md §7.

Limitations (be honest with your team)

Local Spark is a single machine — validate logic locally, scale on GCP once.
Absolute row counts / huge-shuffle behavior will not match production.
If a job hardcodes gs:///BigQuery paths inside the code (not as an argument), parameterize the input so it can point at /data.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ethedla

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Jun 4, 2026

0.2.0

Jun 4, 2026

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

save_gcp_local-0.2.1.tar.gz (38.0 kB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

save_gcp_local-0.2.1-py3-none-any.whl (24.7 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file save_gcp_local-0.2.1.tar.gz.

File metadata

Download URL: save_gcp_local-0.2.1.tar.gz
Upload date: Jun 4, 2026
Size: 38.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for save_gcp_local-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`b3db5a79d3dbb9bddf9ed257a3965377fcb9c1edade0e5b7dccaf32c521e1f64`
MD5	`7e38b3fbc95eb7108138058006cded9d`
BLAKE2b-256	`3e982ac2b21245f4f6f6aefed97b96e25af460b769a9ee180440bbd9eb02630f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for save_gcp_local-0.2.1.tar.gz:

Publisher: deploy.yml on EshwarCVS/save-gcp-local

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: save_gcp_local-0.2.1.tar.gz
- Subject digest: b3db5a79d3dbb9bddf9ed257a3965377fcb9c1edade0e5b7dccaf32c521e1f64
- Sigstore transparency entry: 1726165893
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: EshwarCVS/save-gcp-local@b2056fd1b9375a0f349457708509bb36b7abb9d8
- Branch / Tag: refs/heads/master
- Owner: https://github.com/EshwarCVS
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: deploy.yml@b2056fd1b9375a0f349457708509bb36b7abb9d8
- Trigger Event: push

File details

Details for the file save_gcp_local-0.2.1-py3-none-any.whl.

File metadata

Download URL: save_gcp_local-0.2.1-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 24.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for save_gcp_local-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5959d70706df8b6508863a973eafef9cc910460b499b4c201f3aa0a6ddfa7e5`
MD5	`616fa1c730256a757f60be454674197a`
BLAKE2b-256	`c382729c167dbf988b26a5d618c563ffcfcd376cadfcd4a540f511ca89b67883`

See more details on using hashes here.

Provenance

The following attestation bundles were made for save_gcp_local-0.2.1-py3-none-any.whl:

Publisher: deploy.yml on EshwarCVS/save-gcp-local

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: save_gcp_local-0.2.1-py3-none-any.whl
- Subject digest: c5959d70706df8b6508863a973eafef9cc910460b499b4c201f3aa0a6ddfa7e5
- Sigstore transparency entry: 1726166128
- Sigstore integration time: Jun 4, 2026
Source repository:
- Permalink: EshwarCVS/save-gcp-local@b2056fd1b9375a0f349457708509bb36b7abb9d8
- Branch / Tag: refs/heads/master
- Owner: https://github.com/EshwarCVS
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: deploy.yml@b2056fd1b9375a0f349457708509bb36b7abb9d8
- Trigger Event: push

save-gcp-local 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

save-gcp-local

Why this exists

What you save

Key features

Install

60-second start

Documentation

How it works

Supported operators

Limitations (be honest with your team)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance