Capture Spark plans, config, and table metadata for Cluster Yield analysis
Project description
cluster-yield-snapshot
Passive Spark plan capture for Cluster Yield analysis. Drop two lines into any notebook — no refactoring, no query registration, no code changes.
Works on Databricks (serverless + classic), EMR, Dataproc, and open-source Spark.
Install
pip install cluster-yield-snapshot
# In a Databricks notebook
%pip install cluster-yield-snapshot
How it works
Two lines at the top. Two lines at the bottom. Everything in between is untouched:
# Cell 1 — start capture
from cluster_yield_snapshot import CYSnapshot
cy = CYSnapshot(spark).start()
# ═══════════════════════════════════════════
# Rest of the notebook — completely unchanged
# ═══════════════════════════════════════════
df = spark.sql("SELECT * FROM orders WHERE date > '2024-01-01'")
users = spark.table("analytics.users")
enriched = df.join(users, "user_id").groupBy("region").agg(sum("amount"))
enriched.write.parquet("s3://output/regional_revenue")
# Last cell — harvest
cy.stop().save()
That's it. Every spark.sql() call, every .collect(), every .write.parquet() in between is silently captured with its full physical plan. On stop(), catalog stats (table sizes, partitions, file counts) are automatically gathered for every table that appeared in the plans.
What it captures
start() hooks into three places:
| Hook | What it catches | Plan timing |
|---|---|---|
spark.sql() |
Every SQL query | At creation (pre-AQE) |
DataFrame actions (.collect(), .show(), .count(), .toPandas(), etc.) |
Execution results | Post-AQE (final plan) |
Write methods (.write.parquet(), .save(), .saveAsTable(), etc.) |
Data output | Post-AQE (final plan) |
When the same query is captured at both spark.sql() time and action time, the action-time plan (post-AQE) replaces the earlier one. You get the plan Spark actually executed, not just the plan it intended to execute.
stop() then collects catalog metadata:
| Data | Source |
|---|---|
| Table size (bytes) | DESCRIBE DETAIL / Catalyst stats |
| Row count | Table properties / Catalyst stats |
| File count, avg file size | DESCRIBE DETAIL |
| Partition columns | DESCRIBE EXTENDED |
| Spark config + drift | sparkContext.getConf() / SET -v |
| Environment | Platform detection (Databricks / YARN / K8s) |
Upload to Cluster Yield
The server analyzes on ingest — runs detectors, estimates costs, diffs against your last snapshot:
cy = CYSnapshot(spark, api_key="cy_...", environment="prod-analytics").start()
# ... notebook ...
cy.stop().upload()
Install with upload support: pip install cluster-yield-snapshot[upload]
Context manager
with CYSnapshot(spark) as cy:
df = spark.sql("SELECT ...")
df.show()
cy.save()
Manual capture (edge cases)
For queries you can't run through start()/stop() (e.g. building a snapshot from known queries without executing them):
cy = CYSnapshot(spark)
cy.query("daily_revenue", "SELECT region, SUM(amount) FROM orders GROUP BY region")
cy.df("enriched", some_existing_dataframe)
cy.save()
Safety
The capture hooks are read-only and wrapped in try/except:
- They only read
queryExecution.executedPlan— no writes, no modifications - If our code fails for any reason, the user's code continues normally
stop()cleanly restores all original methods- A re-entrancy guard prevents our internal Spark calls (catalog stats) from being captured
- The notebook behaves identically with or without capture running
Snapshot JSON envelope
{
"snapshot": { "version": "0.3.0", "capturedAt": "...", "snapshotType": "environment" },
"environment": { "sparkVersion": "3.5.1", "platform": "databricks", ... },
"config": { "all": {}, "optimizerRelevant": {}, "nonDefault": {} },
"catalog": { "tables": { "default.orders": { "sizeInBytes": 85899345920, ... } } },
"plans": [
{
"label": "sql-1-SELECT * FROM orders WHERE ...",
"fingerprint": "a1b2c3d4...",
"plan": [...],
"sql": "SELECT * FROM orders WHERE date > '2024-01-01'",
"trigger": "action.collect"
}
],
"errors": null
}
Compatible with the Cluster Yield Scala analysis engine, the JVM PlanCaptureListener, and the PlanExtractor — the analyzer is agnostic to capture method.
Module structure
cluster_yield_snapshot/
├── __init__.py # Public API: CYSnapshot, snapshot_capture
├── snapshot.py # Orchestrator: start/stop/save/upload
├── _capture.py # Passive capture engine (monkey-patching)
├── plans.py # Plan extraction, operator parsing, fingerprinting
├── catalog.py # Table stats (DESCRIBE DETAIL/EXTENDED/Catalyst)
├── config.py # Spark config capture + drift detection
├── environment.py # Platform detection (Databricks, YARN, K8s)
├── upload.py # HTTP upload to SaaS backend
├── quick_scan.py # Lightweight teaser findings
├── formatting.py # Terminal summary + Databricks HTML
├── _compat.py # Classic PySpark vs Spark Connect abstraction
└── _util.py # Shared utilities
Spark Connect / Serverless
On Spark Connect, the JVM is not accessible. Plan capture falls back to text explain. Catalog stats fall back to DESCRIBE DETAIL and DESCRIBE EXTENDED (no Catalyst stats). The text plan parser runs server-side for full analysis.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cluster_yield_snapshot-0.3.11.tar.gz.
File metadata
- Download URL: cluster_yield_snapshot-0.3.11.tar.gz
- Upload date:
- Size: 51.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2dda45fda0214c6655971f4abb476e4a54fee3db6fbf399004191255398c8e2
|
|
| MD5 |
431874dc8a6da59bec0ebf8775cdd6a6
|
|
| BLAKE2b-256 |
5fa1c1720028059db05185a45008bb8317fcd25f1aa1f5f27aed570aeb1b267d
|
File details
Details for the file cluster_yield_snapshot-0.3.11-py3-none-any.whl.
File metadata
- Download URL: cluster_yield_snapshot-0.3.11-py3-none-any.whl
- Upload date:
- Size: 45.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ce48b6e1a23d2192b9f13aedf261c1ad4c410b542f22f850c246d4e0047d5e4
|
|
| MD5 |
afb1878994f36d0fd9375e7b29231827
|
|
| BLAKE2b-256 |
b746aea0aa2d7e94a0c25eeb3fe350da2570f1d7cddfe87de46bd8616fd0391b
|