Skip to main content

Production-grade PySpark auditing, optimization, and governance framework — static + dynamic analysis with local AI assistance

Project description

SparkGuardian

Production-grade PySpark auditing, optimization, and governance framework.

Tests Rules Lint SARIF Python License

SparkGuardian detects PySpark anti-patterns, analyzes execution plans, generates prioritized remediation plans, and integrates into CI/CD pipelines — all 100% offline, no cloud APIs, no source code exfiltration.


Features

Feature Description
Static AST Analysis Deterministic analysis via libcst — never executes user code
Execution Plan Analysis Parses df.explain() output, detects shuffles / CartesianProduct / AQE
Safe Auto-Refactoring Rewrites code with backup, dry-run preview, interactive mode
Local AI Assistant Ollama-powered explanations, structured analysis, remediation plans
AI Chat Session Interactive multi-turn chat about findings with history management
RAG Knowledge Base Keyword-based Spark knowledge retrieval — no external vector DB
Performance Scoring 0–100 score across reliability, cost, and clarity dimensions
CI/CD Integration Exit codes 0/1/2 — GitHub Actions, GitLab CI, Jenkins
Rule Governance Enable/disable rules, override severity, per-project config (TOML/YAML)
Ignore System # cy:ignore CY008 inline and block suppression with audit trail
Rich Reporting Terminal (colored), JSON, Markdown, HTML (dark theme), SARIF v2.1.0
Cloud Cost Estimator Translates anti-patterns into monthly/yearly $$$ waste on AWS EMR / Databricks / GCP / Azure
Audit History Longitudinal tracking with regression detection + CSV export for academic analysis

Installation

# Core (PyPI)
pip install sparkguardian

# With local AI (Ollama integration)
pip install "sparkguardian[ai]"

# Development install (from source)
git clone https://github.com/MohamedDataX/sparkguardian
cd sparkguardian
pip install -e ".[dev]"

Verify install:

sparkguardian --version          # SparkGuardian v0.1.1
sparkguardian --help             # list 9 commands
sparkguardian analyze app.py

Requirements: Python 3.11 or higher. Tested on Python 3.11, 3.12, 3.13.

Build from source

pip install build
python -m build                  # creates dist/*.whl + *.tar.gz
twine check dist/*               # validate before upload
pip install dist/sparkguardian-0.1.1-py3-none-any.whl

Quick Start

# Analyze a file
sparkguardian analyze app.py

# Analyze a directory
sparkguardian analyze src/

# Get JSON output (for CI integration)
sparkguardian analyze app.py --json

# Generate HTML report
sparkguardian analyze app.py --html report.html

# Preview auto-fixes (no file changes)
sparkguardian fix app.py --dry-run

# Apply safe auto-fixes (with backup)
sparkguardian fix app.py

# Analyze Spark execution plan
sparkguardian explain-plan plan.txt

# List all available rules
sparkguardian list-rules

# AI-powered remediation plan
sparkguardian remediation app.py

# Interactive AI chat about findings
sparkguardian chat app.py

# Check AI (Ollama) availability
sparkguardian ai-status

# Cloud cost estimation — translate issues to $/month
sparkguardian cost app.py
sparkguardian cost app.py --data-tb 5 --nodes 30 --cloud databricks

# SARIF v2.1.0 output for GitHub Code Scanning
sparkguardian analyze app.py --sarif report.sarif

# Audit history — track score evolution over time
sparkguardian analyze app.py --history          # record snapshot
sparkguardian history                            # view evolution
sparkguardian history --csv evolution.csv       # academic analysis export

Exit Codes

Code Meaning CI behavior
0 No issues Build passes
1 CRITICAL issue found Build fails
2 Warnings only Build passes (configurable)

Rules

Static Rules — AST Analysis (16 rules)

Rule Title Severity Impact Autofix
CY001 .collect() — driver memory risk HIGH RELIABILITY
CY002 .toPandas() — large local materialization HIGH RELIABILITY
CY003 .count() abuse — full scan triggered MEDIUM COST
CY004 Spark action inside loop CRITICAL RELIABILITY
CY005 .show() in production code LOW RELIABILITY
CY008 repartition(1) detected HIGH COST
CY009 Chained withColumn() (5+) MEDIUM COST
CY010 Implicit join — missing how= LOW CLARITY
CY011 select("*") — broad column selection LOW CLARITY
CY012 cache()/persist() — verify necessity LOW COST
CY013 Repeated scan of same data source HIGH COST
CY014 filter() after groupBy() — reorder MEDIUM COST
CY016 crossJoin() — explicit CartesianProduct HIGH COST
CY017 distinct() — full shuffle MEDIUM COST
CY018 orderBy() without limit() — global sort MEDIUM COST
CY020 Python UDF — use native SQL functions HIGH COST

Dynamic Rules — Execution Plan (5 rules)

Rule Title Severity Impact
CY050 CartesianProduct node detected CRITICAL COST
CY051 Excessive shuffle (Exchange) nodes HIGH COST
CY052 SortMergeJoin — broadcast opportunity MEDIUM COST
CY053 Window function without partitionBy HIGH COST
CY054 AQE (AdaptiveSparkPlan) not active MEDIUM COST

Ignore System

Suppress rules inline or by block, with full audit trail in reports.

# Inline — suppress on this line only
df.repartition(1)  # cy:ignore CY008

# Block — suppress across multiple lines
# cy:ignore-start CY008
df_single = df.repartition(1)
df_single.write.parquet("output/")
# cy:ignore-end CY008

Ignored findings still appear in reports as suppressed (use --show-ignored).


Configuration

Create .sparkguardian.toml at project root:

[sparkguardian]
fail_on_critical = true
fail_on_high = false
warning_threshold = 10
ignore_paths = ["tests/", "examples/"]

[rules.CY003]
enabled = true
severity = "MEDIUM"

[rules.CY008]
enabled = true
severity = "HIGH"

# Disable a rule project-wide
[rules.CY005]
enabled = false

YAML format also supported (.sparkguardian.yaml).


Local AI Setup

All inference runs locally via Ollama. No data leaves your machine.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a supported model
ollama pull deepseek-coder:6.7b   # recommended
# or: ollama pull codellama:7b
# or: ollama pull llama3:8b
# or: ollama pull mistral:7b

# Start Ollama server
ollama serve

# Verify setup
sparkguardian ai-status

AI Commands

# AI explanations with each finding
sparkguardian analyze app.py --ai

# AI-powered prioritized remediation plan
sparkguardian remediation app.py
sparkguardian remediation app.py --json   # machine-readable

# Interactive chat about findings
sparkguardian chat app.py

AI Architecture

LocalLLM
├── LRU response cache (128 entries, hit-rate tracking)
├── System prompt injection (role anchor)
├── Structured JSON output parsing (AIAnalysis dataclass)
├── Streaming support (token-by-token terminal output)
└── Auto model selection (deepseek-coder → codellama → llama3 → mistral)

AISession (multi-turn chat)
├── Conversation history with automatic trimming
├── Context injection (RAG knowledge base)
└── Reset / turn counting

RemediationPlanner
├── LLM-driven: priority_order, effort_estimate, quick_wins, blockers
└── Heuristic fallback (deterministic, works without Ollama)

SparkKnowledgeBase (RAG)
├── 24 entries from official Apache Spark 4.1.1 documentation
│   Sources: tuning.html · sql-performance-tuning.html · rdd-programming-guide.html
├── query(question, rule_id) — lookup by rule_id or keyword scoring
├── query_with_config(rule_id) → (content, config_params[]) — includes spark.* params
├── get_config_params(rule_id) → list[str] — exact config keys with defaults
└── sources() → list[str] — traceable documentation URLs per entry

Cloud Cost Estimator

Translates detected anti-patterns into estimated monthly/yearly USD waste on major cloud providers. Cost models are derived from official pricing (AWS EMR, Databricks, GCP Dataproc, Azure Synapse) and published Spark performance research.

sparkguardian cost examples/bad_pipeline.py
# 💸 Estimated waste: $2,407/month ($28,884/year)

sparkguardian cost prod_pipeline/ --data-tb 50 --nodes 100 --cloud databricks
# 💸 Estimated waste: $156,420/month ($1,877,040/year)

sparkguardian cost app.py --json   # machine-readable for FinOps integration

Pricing methodology (2025-2026 baseline):

  • AWS EMR on-demand: ~$0.063/vCPU-hour
  • Databricks DBU: $0.40/DBU-hour (Jobs Compute) — ×1.3 vs EMR
  • GCP Dataproc: ×0.9 vs EMR
  • Azure Synapse: ×1.1 vs EMR
  • S3: $0.023/GB stored + $0.005/1k GET requests
  • Cross-AZ egress: $0.09/GB

Each rule has a calibrated cost model factoring in severity, occurrence count (sub-linear), and workload scale.


Audit History (Longitudinal Analysis)

Track code quality evolution across commits, releases, or daily runs. Supports CSV export for academic and research analysis.

# Record a snapshot after analysis
sparkguardian analyze . --history

# View evolution
sparkguardian history
# ↘️ Latest delta: -8.5
# ⚠️ REGRESSION DETECTED (delta > -5)

# Export to CSV for plotting / longitudinal study
sparkguardian history --csv evolution.csv

# Filter by file
sparkguardian history --target src/pipeline.py

Snapshots stored in .sparkguardian-history.json with git commit/branch metadata for traceability.


SARIF Output (GitHub Code Scanning)

Native integration with GitHub Code Scanning, GitLab SAST, Azure DevOps, and SonarQube:

sparkguardian analyze . --sarif report.sarif

In .github/workflows/sparkguardian.yml:

- name: SparkGuardian
  run: sparkguardian analyze . --sarif sparkguardian.sarif
- uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: sparkguardian.sarif

Findings appear directly in the Security tab of GitHub with severity-graded annotations on pull requests.


Architecture

sparkguardian/
├── cli.py                    # Typer CLI (9 commands)
├── config.py                 # TOML/YAML loader
├── constants.py              # Exit codes, defaults
│
├── parser/
│   ├── ast_analyzer.py       # libcst static analysis
│   ├── plan_analyzer.py      # Execution plan analysis
│   ├── execution_parser.py   # Plan parser
│   ├── ignore_handler.py     # cy:ignore engine
│   └── node_extractors.py    # NodeType classification
│
├── rules/
│   ├── base.py               # BaseRule ABC
│   ├── registry.py           # @register decorator system
│   ├── static_rules.py       # 16 AST rules
│   ├── dynamic_rules.py      # 5 plan rules
│   ├── categories.py
│   └── severity.py
│
├── models/
│   ├── issue.py              # Issue, Location, Severity, ImpactType
│   ├── report.py             # AnalysisReport
│   └── execution_node.py     # ExecutionNode, NodeType
│
├── rewriter/
│   ├── safe_rewriter.py      # SafeRewriter (backup + apply)
│   ├── transformer.py        # libcst CSTTransformers
│   └── diff_generator.py     # Unified diff output
│
├── ai/
│   ├── __init__.py           # Public exports: LocalLLM, AISession, SparkKnowledgeBase…
│   ├── local_llm.py          # LocalLLM (cache + streaming + structured output)
│   ├── session.py            # AISession (multi-turn history, context injection)
│   ├── remediation.py        # RemediationPlanner (LLM + heuristic fallback)
│   ├── structured.py         # AIAnalysis, RemediationPlan dataclasses + JSON parser
│   ├── cache.py              # LRU AIResponseCache (128 entries, hit-rate tracking)
│   ├── rag.py                # SparkKnowledgeBase (24 entries, official Spark 4.1.1 docs)
│   └── prompts.py            # Versioned prompt library (system, structured, chat, remediation)
│
├── reporting/
│   ├── terminal_reporter.py  # Rich colored output
│   ├── json_reporter.py
│   ├── markdown_reporter.py
│   ├── html_reporter.py      # Dark-theme HTML
│   └── sarif_reporter.py     # SARIF v2.1.0 (GitHub Code Scanning)
│
├── scoring/
│   └── score_engine.py       # 0–100 performance score
│
├── cost/                     # NEW
│   └── estimator.py          # Cloud cost models (EMR/Databricks/GCP/Azure)
│
├── history/                  # NEW
│   └── tracker.py            # Longitudinal audit snapshots + CSV export
│
├── ci/
│   └── exit_codes.py         # resolve_exit_code()
│
└── utils/
    ├── logger.py
    └── file_utils.py         # backup_file, collect_python_files

Design for Future Rust Acceleration

The parser/ layer is intentionally isolated for future rewrite in Rust via PyO3 / maturin. The rule engine and reporting layers have no parser dependency, enabling drop-in acceleration without breaking the public API.


CI/CD Integration

GitHub Actions

- name: SparkGuardian Audit
  run: sparkguardian analyze . --json > report.json
  # Exit 1 on CRITICAL → build fails automatically

See .github/workflows/sparkguardian.yml for full example.

GitLab CI

sparkguardian:
  script: sparkguardian analyze . --json > report.json

See ci/gitlab-ci.yml.

Jenkins

See ci/Jenkinsfile for a pipeline with HTML report archiving.


Development

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_ast.py -v

# With coverage
pytest tests/ --cov=sparkguardian --cov-report=term-missing

373 tests | 0 failures | 0 warnings | 0 ruff errors | ~0.5s

Test file Covers
test_ai.py LLM cache, session, structured output, remediation, prompts
test_ast.py All 16 static rules, analyzer edge cases, ignore system
test_ci.py Exit codes (0/1/2)
test_config.py TOML, YAML, defaults, rule overrides
test_file_utils.py backup_file, collect_python_files
test_ignore.py cy:ignore inline and block directives
test_models.py Issue, Location, AnalysisReport, ExecutionNode
test_node_extractors.py NodeType classification, plan parser
test_plan.py Dynamic rules, execution plan analysis
test_rag.py Knowledge base entries, keyword search, config params, sources
test_registry.py Rule registry, @register decorator
test_reporting.py JSON, Markdown, HTML reporters
test_rewriter.py CST transformer, diff generator
test_safe_rewriter.py SafeRewriter dry-run, apply, backup
test_scoring.py 0–100 score engine, penalty matrix
test_cost.py Cloud cost estimator (4 clouds, workload scaling, severity weights)
test_sarif.py SARIF v2.1.0 format compliance, severity mapping
test_history.py Audit history, CSV export, regression detection, persistence

Supported Models (Local AI)

Model Pull command Notes
DeepSeek-Coder 6.7B ollama pull deepseek-coder:6.7b Recommended
CodeLlama 7B ollama pull codellama:7b Good code quality
Llama 3 8B ollama pull llama3:8b General purpose
Mistral 7B ollama pull mistral:7b Fast inference

SparkGuardian auto-selects the first available model from this list.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkguardian-0.1.1.tar.gz (77.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sparkguardian-0.1.1-py3-none-any.whl (69.8 kB view details)

Uploaded Python 3

File details

Details for the file sparkguardian-0.1.1.tar.gz.

File metadata

  • Download URL: sparkguardian-0.1.1.tar.gz
  • Upload date:
  • Size: 77.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for sparkguardian-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9d13baac592ec0b038ba7b10d611be775eb2b7faf1ee5de9354dc1fad73ee3fa
MD5 8c3c7e2fcbffb6bc3aef145828dac960
BLAKE2b-256 832f34f54b943e85c751a285a1dfb65e576ccd17b1e2f8f9d5b9f9d67ab1edbd

See more details on using hashes here.

File details

Details for the file sparkguardian-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sparkguardian-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 69.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for sparkguardian-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e012fd9cc98a6fbb4ab937a7b920b0053361f6f33bb50618f0d5141027abc8a4
MD5 a71c0af44446c528a1c055ab4a7607d9
BLAKE2b-256 d61c0400730ea79fe381835d35779dc285b6d5eee543fb9821ed8f6857fd3060

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page