Production-grade PySpark auditing, optimization, and governance framework — static + dynamic analysis with local AI assistance
Project description
SparkGuardian
Production-grade PySpark auditing, optimization, and governance framework.
SparkGuardian detects PySpark anti-patterns, analyzes execution plans, generates prioritized remediation plans, and integrates into CI/CD pipelines — all 100% offline, no cloud APIs, no source code exfiltration.
Features
| Feature | Description |
|---|---|
| Static AST Analysis | Deterministic analysis via libcst — never executes user code |
| Execution Plan Analysis | Parses df.explain() output, detects shuffles / CartesianProduct / AQE |
| Safe Auto-Refactoring | Rewrites code with backup, dry-run preview, interactive mode |
| Local AI Assistant | Ollama-powered explanations, structured analysis, remediation plans |
| AI Chat Session | Interactive multi-turn chat about findings with history management |
| RAG Knowledge Base | Keyword-based Spark knowledge retrieval — no external vector DB |
| Performance Scoring | 0–100 score across reliability, cost, and clarity dimensions |
| CI/CD Integration | Exit codes 0/1/2 — GitHub Actions, GitLab CI, Jenkins |
| Rule Governance | Enable/disable rules, override severity, per-project config (TOML/YAML) |
| Ignore System | # cy:ignore CY008 inline and block suppression with audit trail |
| Rich Reporting | Terminal (colored), JSON, Markdown, HTML (dark theme), SARIF v2.1.0 |
| Cloud Cost Estimator | Translates anti-patterns into monthly/yearly $$$ waste on AWS EMR / Databricks / GCP / Azure |
| Audit History | Longitudinal tracking with regression detection + CSV export for academic analysis |
Installation
# Core (PyPI)
pip install sparkguardian
# With local AI (Ollama integration)
pip install "sparkguardian[ai]"
# Development install (from source)
git clone https://github.com/sparkguardian/sparkguardian
cd sparkguardian
pip install -e ".[dev]"
Verify install:
sparkguardian --version # SparkGuardian v0.1.0
sparkguardian --help # list 9 commands
sparkguardian analyze app.py
Requirements: Python 3.11 or higher. Tested on Python 3.11, 3.12, 3.13.
Build from source
pip install build
python -m build # creates dist/*.whl + *.tar.gz
twine check dist/* # validate before upload
pip install dist/sparkguardian-0.1.0-py3-none-any.whl
Quick Start
# Analyze a file
sparkguardian analyze app.py
# Analyze a directory
sparkguardian analyze src/
# Get JSON output (for CI integration)
sparkguardian analyze app.py --json
# Generate HTML report
sparkguardian analyze app.py --html report.html
# Preview auto-fixes (no file changes)
sparkguardian fix app.py --dry-run
# Apply safe auto-fixes (with backup)
sparkguardian fix app.py
# Analyze Spark execution plan
sparkguardian explain-plan plan.txt
# List all available rules
sparkguardian list-rules
# AI-powered remediation plan
sparkguardian remediation app.py
# Interactive AI chat about findings
sparkguardian chat app.py
# Check AI (Ollama) availability
sparkguardian ai-status
# Cloud cost estimation — translate issues to $/month
sparkguardian cost app.py
sparkguardian cost app.py --data-tb 5 --nodes 30 --cloud databricks
# SARIF v2.1.0 output for GitHub Code Scanning
sparkguardian analyze app.py --sarif report.sarif
# Audit history — track score evolution over time
sparkguardian analyze app.py --history # record snapshot
sparkguardian history # view evolution
sparkguardian history --csv evolution.csv # academic analysis export
Exit Codes
| Code | Meaning | CI behavior |
|---|---|---|
0 |
No issues | Build passes |
1 |
CRITICAL issue found | Build fails |
2 |
Warnings only | Build passes (configurable) |
Rules
Static Rules — AST Analysis (16 rules)
| Rule | Title | Severity | Impact | Autofix |
|---|---|---|---|---|
| CY001 | .collect() — driver memory risk |
HIGH | RELIABILITY | |
| CY002 | .toPandas() — large local materialization |
HIGH | RELIABILITY | |
| CY003 | .count() abuse — full scan triggered |
MEDIUM | COST | |
| CY004 | Spark action inside loop | CRITICAL | RELIABILITY | |
| CY005 | .show() in production code |
LOW | RELIABILITY | |
| CY008 | repartition(1) detected |
HIGH | COST | ✓ |
| CY009 | Chained withColumn() (5+) |
MEDIUM | COST | |
| CY010 | Implicit join — missing how= |
LOW | CLARITY | |
| CY011 | select("*") — broad column selection |
LOW | CLARITY | |
| CY012 | cache()/persist() — verify necessity |
LOW | COST | |
| CY013 | Repeated scan of same data source | HIGH | COST | |
| CY014 | filter() after groupBy() — reorder |
MEDIUM | COST | |
| CY016 | crossJoin() — explicit CartesianProduct |
HIGH | COST | |
| CY017 | distinct() — full shuffle |
MEDIUM | COST | |
| CY018 | orderBy() without limit() — global sort |
MEDIUM | COST | |
| CY020 | Python UDF — use native SQL functions | HIGH | COST |
Dynamic Rules — Execution Plan (5 rules)
| Rule | Title | Severity | Impact |
|---|---|---|---|
| CY050 | CartesianProduct node detected |
CRITICAL | COST |
| CY051 | Excessive shuffle (Exchange) nodes |
HIGH | COST |
| CY052 | SortMergeJoin — broadcast opportunity |
MEDIUM | COST |
| CY053 | Window function without partitionBy |
HIGH | COST |
| CY054 | AQE (AdaptiveSparkPlan) not active |
MEDIUM | COST |
Ignore System
Suppress rules inline or by block, with full audit trail in reports.
# Inline — suppress on this line only
df.repartition(1) # cy:ignore CY008
# Block — suppress across multiple lines
# cy:ignore-start CY008
df_single = df.repartition(1)
df_single.write.parquet("output/")
# cy:ignore-end CY008
Ignored findings still appear in reports as suppressed (use --show-ignored).
Configuration
Create .sparkguardian.toml at project root:
[sparkguardian]
fail_on_critical = true
fail_on_high = false
warning_threshold = 10
ignore_paths = ["tests/", "examples/"]
[rules.CY003]
enabled = true
severity = "MEDIUM"
[rules.CY008]
enabled = true
severity = "HIGH"
# Disable a rule project-wide
[rules.CY005]
enabled = false
YAML format also supported (.sparkguardian.yaml).
Local AI Setup
All inference runs locally via Ollama. No data leaves your machine.
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a supported model
ollama pull deepseek-coder:6.7b # recommended
# or: ollama pull codellama:7b
# or: ollama pull llama3:8b
# or: ollama pull mistral:7b
# Start Ollama server
ollama serve
# Verify setup
sparkguardian ai-status
AI Commands
# AI explanations with each finding
sparkguardian analyze app.py --ai
# AI-powered prioritized remediation plan
sparkguardian remediation app.py
sparkguardian remediation app.py --json # machine-readable
# Interactive chat about findings
sparkguardian chat app.py
AI Architecture
LocalLLM
├── LRU response cache (128 entries, hit-rate tracking)
├── System prompt injection (role anchor)
├── Structured JSON output parsing (AIAnalysis dataclass)
├── Streaming support (token-by-token terminal output)
└── Auto model selection (deepseek-coder → codellama → llama3 → mistral)
AISession (multi-turn chat)
├── Conversation history with automatic trimming
├── Context injection (RAG knowledge base)
└── Reset / turn counting
RemediationPlanner
├── LLM-driven: priority_order, effort_estimate, quick_wins, blockers
└── Heuristic fallback (deterministic, works without Ollama)
SparkKnowledgeBase (RAG)
├── 24 entries from official Apache Spark 4.1.1 documentation
│ Sources: tuning.html · sql-performance-tuning.html · rdd-programming-guide.html
├── query(question, rule_id) — lookup by rule_id or keyword scoring
├── query_with_config(rule_id) → (content, config_params[]) — includes spark.* params
├── get_config_params(rule_id) → list[str] — exact config keys with defaults
└── sources() → list[str] — traceable documentation URLs per entry
Cloud Cost Estimator
Translates detected anti-patterns into estimated monthly/yearly USD waste on major cloud providers. Cost models are derived from official pricing (AWS EMR, Databricks, GCP Dataproc, Azure Synapse) and published Spark performance research.
sparkguardian cost examples/bad_pipeline.py
# 💸 Estimated waste: $2,407/month ($28,884/year)
sparkguardian cost prod_pipeline/ --data-tb 50 --nodes 100 --cloud databricks
# 💸 Estimated waste: $156,420/month ($1,877,040/year)
sparkguardian cost app.py --json # machine-readable for FinOps integration
Pricing methodology (2025-2026 baseline):
- AWS EMR on-demand: ~$0.063/vCPU-hour
- Databricks DBU: $0.40/DBU-hour (Jobs Compute) — ×1.3 vs EMR
- GCP Dataproc: ×0.9 vs EMR
- Azure Synapse: ×1.1 vs EMR
- S3: $0.023/GB stored + $0.005/1k GET requests
- Cross-AZ egress: $0.09/GB
Each rule has a calibrated cost model factoring in severity, occurrence count (sub-linear), and workload scale.
Audit History (Longitudinal Analysis)
Track code quality evolution across commits, releases, or daily runs. Supports CSV export for academic and research analysis.
# Record a snapshot after analysis
sparkguardian analyze . --history
# View evolution
sparkguardian history
# ↘️ Latest delta: -8.5
# ⚠️ REGRESSION DETECTED (delta > -5)
# Export to CSV for plotting / longitudinal study
sparkguardian history --csv evolution.csv
# Filter by file
sparkguardian history --target src/pipeline.py
Snapshots stored in .sparkguardian-history.json with git commit/branch metadata for traceability.
SARIF Output (GitHub Code Scanning)
Native integration with GitHub Code Scanning, GitLab SAST, Azure DevOps, and SonarQube:
sparkguardian analyze . --sarif report.sarif
In .github/workflows/sparkguardian.yml:
- name: SparkGuardian
run: sparkguardian analyze . --sarif sparkguardian.sarif
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: sparkguardian.sarif
Findings appear directly in the Security tab of GitHub with severity-graded annotations on pull requests.
Architecture
sparkguardian/
├── cli.py # Typer CLI (9 commands)
├── config.py # TOML/YAML loader
├── constants.py # Exit codes, defaults
│
├── parser/
│ ├── ast_analyzer.py # libcst static analysis
│ ├── plan_analyzer.py # Execution plan analysis
│ ├── execution_parser.py # Plan parser
│ ├── ignore_handler.py # cy:ignore engine
│ └── node_extractors.py # NodeType classification
│
├── rules/
│ ├── base.py # BaseRule ABC
│ ├── registry.py # @register decorator system
│ ├── static_rules.py # 16 AST rules
│ ├── dynamic_rules.py # 5 plan rules
│ ├── categories.py
│ └── severity.py
│
├── models/
│ ├── issue.py # Issue, Location, Severity, ImpactType
│ ├── report.py # AnalysisReport
│ └── execution_node.py # ExecutionNode, NodeType
│
├── rewriter/
│ ├── safe_rewriter.py # SafeRewriter (backup + apply)
│ ├── transformer.py # libcst CSTTransformers
│ └── diff_generator.py # Unified diff output
│
├── ai/
│ ├── __init__.py # Public exports: LocalLLM, AISession, SparkKnowledgeBase…
│ ├── local_llm.py # LocalLLM (cache + streaming + structured output)
│ ├── session.py # AISession (multi-turn history, context injection)
│ ├── remediation.py # RemediationPlanner (LLM + heuristic fallback)
│ ├── structured.py # AIAnalysis, RemediationPlan dataclasses + JSON parser
│ ├── cache.py # LRU AIResponseCache (128 entries, hit-rate tracking)
│ ├── rag.py # SparkKnowledgeBase (24 entries, official Spark 4.1.1 docs)
│ └── prompts.py # Versioned prompt library (system, structured, chat, remediation)
│
├── reporting/
│ ├── terminal_reporter.py # Rich colored output
│ ├── json_reporter.py
│ ├── markdown_reporter.py
│ ├── html_reporter.py # Dark-theme HTML
│ └── sarif_reporter.py # SARIF v2.1.0 (GitHub Code Scanning)
│
├── scoring/
│ └── score_engine.py # 0–100 performance score
│
├── cost/ # NEW
│ └── estimator.py # Cloud cost models (EMR/Databricks/GCP/Azure)
│
├── history/ # NEW
│ └── tracker.py # Longitudinal audit snapshots + CSV export
│
├── ci/
│ └── exit_codes.py # resolve_exit_code()
│
└── utils/
├── logger.py
└── file_utils.py # backup_file, collect_python_files
Design for Future Rust Acceleration
The parser/ layer is intentionally isolated for future rewrite in Rust via PyO3 / maturin. The rule engine and reporting layers have no parser dependency, enabling drop-in acceleration without breaking the public API.
CI/CD Integration
GitHub Actions
- name: SparkGuardian Audit
run: sparkguardian analyze . --json > report.json
# Exit 1 on CRITICAL → build fails automatically
See .github/workflows/sparkguardian.yml for full example.
GitLab CI
sparkguardian:
script: sparkguardian analyze . --json > report.json
See ci/gitlab-ci.yml.
Jenkins
See ci/Jenkinsfile for a pipeline with HTML report archiving.
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_ast.py -v
# With coverage
pytest tests/ --cov=sparkguardian --cov-report=term-missing
373 tests | 0 failures | 0 warnings | 0 ruff errors | ~0.5s
| Test file | Covers |
|---|---|
test_ai.py |
LLM cache, session, structured output, remediation, prompts |
test_ast.py |
All 16 static rules, analyzer edge cases, ignore system |
test_ci.py |
Exit codes (0/1/2) |
test_config.py |
TOML, YAML, defaults, rule overrides |
test_file_utils.py |
backup_file, collect_python_files |
test_ignore.py |
cy:ignore inline and block directives |
test_models.py |
Issue, Location, AnalysisReport, ExecutionNode |
test_node_extractors.py |
NodeType classification, plan parser |
test_plan.py |
Dynamic rules, execution plan analysis |
test_rag.py |
Knowledge base entries, keyword search, config params, sources |
test_registry.py |
Rule registry, @register decorator |
test_reporting.py |
JSON, Markdown, HTML reporters |
test_rewriter.py |
CST transformer, diff generator |
test_safe_rewriter.py |
SafeRewriter dry-run, apply, backup |
test_scoring.py |
0–100 score engine, penalty matrix |
test_cost.py |
Cloud cost estimator (4 clouds, workload scaling, severity weights) |
test_sarif.py |
SARIF v2.1.0 format compliance, severity mapping |
test_history.py |
Audit history, CSV export, regression detection, persistence |
Supported Models (Local AI)
| Model | Pull command | Notes |
|---|---|---|
| DeepSeek-Coder 6.7B | ollama pull deepseek-coder:6.7b |
Recommended |
| CodeLlama 7B | ollama pull codellama:7b |
Good code quality |
| Llama 3 8B | ollama pull llama3:8b |
General purpose |
| Mistral 7B | ollama pull mistral:7b |
Fast inference |
SparkGuardian auto-selects the first available model from this list.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sparkguardian-0.1.0.tar.gz.
File metadata
- Download URL: sparkguardian-0.1.0.tar.gz
- Upload date:
- Size: 77.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffecd4968a184d3c6bafbfdde21f9a62fe68b9b735aadebf51382b654dc5121e
|
|
| MD5 |
7da7ee6c6596c072cfde8fd6d7b99788
|
|
| BLAKE2b-256 |
f4c2b2b04966590ec18aaabfb7213491684069bfa645abcaef5f4a6c272489c0
|
File details
Details for the file sparkguardian-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sparkguardian-0.1.0-py3-none-any.whl
- Upload date:
- Size: 69.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5209e1a201c3200cfe78b88d74b4225c44b8b5bfdbb3ad9d8bbaaea2e1085f2a
|
|
| MD5 |
16f3c38c5a7a62f2eba087a01a54ac28
|
|
| BLAKE2b-256 |
21e5c964eca629c851bc7a99c79c09c4e69ebb08d884fab7028f32319741a632
|