Production-grade PySpark auditing, optimization, and governance framework — static + dynamic analysis with local AI assistance

These details have not been verified by PyPI

Project links

Project description

SparkGuardian

Production-grade PySpark auditing, optimization, and governance framework.

SparkGuardian detects PySpark anti-patterns, analyzes execution plans, generates prioritized remediation plans, and integrates into CI/CD pipelines — all 100% offline, no cloud APIs, no source code exfiltration.

Features

Feature	Description
Static AST Analysis	Deterministic analysis via `libcst` — never executes user code
Execution Plan Analysis	Parses `df.explain()` output, detects shuffles / CartesianProduct / AQE
Safe Auto-Refactoring	Rewrites code with backup, dry-run preview, interactive mode
Local AI Assistant	Ollama-powered explanations, structured analysis, remediation plans
AI Chat Session	Interactive multi-turn chat about findings with history management
RAG Knowledge Base	Keyword-based Spark knowledge retrieval — no external vector DB
Performance Scoring	0–100 score across reliability, cost, and clarity dimensions
CI/CD Integration	Exit codes 0/1/2 — GitHub Actions, GitLab CI, Jenkins
Rule Governance	Enable/disable rules, override severity, per-project config (TOML/YAML)
Ignore System	`# cy:ignore CY008` inline and block suppression with audit trail
Rich Reporting	Terminal (colored), JSON, Markdown, HTML (dark theme), SARIF v2.1.0
Cloud Cost Estimator	Translates anti-patterns into monthly/yearly $$$ waste on AWS EMR / Databricks / GCP / Azure
Audit History	Longitudinal tracking with regression detection + CSV export for academic analysis

Installation

# Core (PyPI)
pip install sparkguardian

# With local AI (Ollama integration)
pip install "sparkguardian[ai]"

# Development install (from source)
git clone https://github.com/sparkguardian/sparkguardian
cd sparkguardian
pip install -e ".[dev]"

Verify install:

sparkguardian --version          # SparkGuardian v0.1.0
sparkguardian --help             # list 9 commands
sparkguardian analyze app.py

Requirements: Python 3.11 or higher. Tested on Python 3.11, 3.12, 3.13.

Build from source

pip install build
python -m build                  # creates dist/*.whl + *.tar.gz
twine check dist/*               # validate before upload
pip install dist/sparkguardian-0.1.0-py3-none-any.whl

Quick Start

# Analyze a file
sparkguardian analyze app.py

# Analyze a directory
sparkguardian analyze src/

# Get JSON output (for CI integration)
sparkguardian analyze app.py --json

# Generate HTML report
sparkguardian analyze app.py --html report.html

# Preview auto-fixes (no file changes)
sparkguardian fix app.py --dry-run

# Apply safe auto-fixes (with backup)
sparkguardian fix app.py

# Analyze Spark execution plan
sparkguardian explain-plan plan.txt

# List all available rules
sparkguardian list-rules

# AI-powered remediation plan
sparkguardian remediation app.py

# Interactive AI chat about findings
sparkguardian chat app.py

# Check AI (Ollama) availability
sparkguardian ai-status

# Cloud cost estimation — translate issues to $/month
sparkguardian cost app.py
sparkguardian cost app.py --data-tb 5 --nodes 30 --cloud databricks

# SARIF v2.1.0 output for GitHub Code Scanning
sparkguardian analyze app.py --sarif report.sarif

# Audit history — track score evolution over time
sparkguardian analyze app.py --history          # record snapshot
sparkguardian history                            # view evolution
sparkguardian history --csv evolution.csv       # academic analysis export

Exit Codes

Code	Meaning	CI behavior
`0`	No issues	Build passes
`1`	CRITICAL issue found	Build fails
`2`	Warnings only	Build passes (configurable)

Rules

Static Rules — AST Analysis (16 rules)

Rule	Title	Severity	Impact	Autofix
CY001	`.collect()` — driver memory risk	HIGH	RELIABILITY
CY002	`.toPandas()` — large local materialization	HIGH	RELIABILITY
CY003	`.count()` abuse — full scan triggered	MEDIUM	COST
CY004	Spark action inside loop	CRITICAL	RELIABILITY
CY005	`.show()` in production code	LOW	RELIABILITY
CY008	`repartition(1)` detected	HIGH	COST	✓
CY009	Chained `withColumn()` (5+)	MEDIUM	COST
CY010	Implicit join — missing `how=`	LOW	CLARITY
CY011	`select("*")` — broad column selection	LOW	CLARITY
CY012	`cache()`/`persist()` — verify necessity	LOW	COST
CY013	Repeated scan of same data source	HIGH	COST
CY014	`filter()` after `groupBy()` — reorder	MEDIUM	COST
CY016	`crossJoin()` — explicit CartesianProduct	HIGH	COST
CY017	`distinct()` — full shuffle	MEDIUM	COST
CY018	`orderBy()` without `limit()` — global sort	MEDIUM	COST
CY020	Python UDF — use native SQL functions	HIGH	COST

Dynamic Rules — Execution Plan (5 rules)

Rule	Title	Severity	Impact
CY050	`CartesianProduct` node detected	CRITICAL	COST
CY051	Excessive shuffle (`Exchange`) nodes	HIGH	COST
CY052	`SortMergeJoin` — broadcast opportunity	MEDIUM	COST
CY053	Window function without `partitionBy`	HIGH	COST
CY054	AQE (`AdaptiveSparkPlan`) not active	MEDIUM	COST

Ignore System

Suppress rules inline or by block, with full audit trail in reports.

# Inline — suppress on this line only
df.repartition(1)  # cy:ignore CY008

# Block — suppress across multiple lines
# cy:ignore-start CY008
df_single = df.repartition(1)
df_single.write.parquet("output/")
# cy:ignore-end CY008

Ignored findings still appear in reports as suppressed (use --show-ignored).

Configuration

Create .sparkguardian.toml at project root:

[sparkguardian]
fail_on_critical = true
fail_on_high = false
warning_threshold = 10
ignore_paths = ["tests/", "examples/"]

[rules.CY003]
enabled = true
severity = "MEDIUM"

[rules.CY008]
enabled = true
severity = "HIGH"

# Disable a rule project-wide
[rules.CY005]
enabled = false

YAML format also supported (.sparkguardian.yaml).

Local AI Setup

All inference runs locally via Ollama. No data leaves your machine.

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a supported model
ollama pull deepseek-coder:6.7b   # recommended
# or: ollama pull codellama:7b
# or: ollama pull llama3:8b
# or: ollama pull mistral:7b

# Start Ollama server
ollama serve

# Verify setup
sparkguardian ai-status

AI Commands

# AI explanations with each finding
sparkguardian analyze app.py --ai

# AI-powered prioritized remediation plan
sparkguardian remediation app.py
sparkguardian remediation app.py --json   # machine-readable

# Interactive chat about findings
sparkguardian chat app.py

AI Architecture

LocalLLM
├── LRU response cache (128 entries, hit-rate tracking)
├── System prompt injection (role anchor)
├── Structured JSON output parsing (AIAnalysis dataclass)
├── Streaming support (token-by-token terminal output)
└── Auto model selection (deepseek-coder → codellama → llama3 → mistral)

AISession (multi-turn chat)
├── Conversation history with automatic trimming
├── Context injection (RAG knowledge base)
└── Reset / turn counting

RemediationPlanner
├── LLM-driven: priority_order, effort_estimate, quick_wins, blockers
└── Heuristic fallback (deterministic, works without Ollama)

SparkKnowledgeBase (RAG)
├── 24 entries from official Apache Spark 4.1.1 documentation
│   Sources: tuning.html · sql-performance-tuning.html · rdd-programming-guide.html
├── query(question, rule_id) — lookup by rule_id or keyword scoring
├── query_with_config(rule_id) → (content, config_params[]) — includes spark.* params
├── get_config_params(rule_id) → list[str] — exact config keys with defaults
└── sources() → list[str] — traceable documentation URLs per entry

Cloud Cost Estimator

Translates detected anti-patterns into estimated monthly/yearly USD waste on major cloud providers. Cost models are derived from official pricing (AWS EMR, Databricks, GCP Dataproc, Azure Synapse) and published Spark performance research.

sparkguardian cost examples/bad_pipeline.py
# 💸 Estimated waste: $2,407/month ($28,884/year)

sparkguardian cost prod_pipeline/ --data-tb 50 --nodes 100 --cloud databricks
# 💸 Estimated waste: $156,420/month ($1,877,040/year)

sparkguardian cost app.py --json   # machine-readable for FinOps integration

Pricing methodology (2025-2026 baseline):

AWS EMR on-demand: ~$0.063/vCPU-hour
Databricks DBU: $0.40/DBU-hour (Jobs Compute) — ×1.3 vs EMR
GCP Dataproc: ×0.9 vs EMR
Azure Synapse: ×1.1 vs EMR
S3: $0.023/GB stored + $0.005/1k GET requests
Cross-AZ egress: $0.09/GB

Each rule has a calibrated cost model factoring in severity, occurrence count (sub-linear), and workload scale.

Audit History (Longitudinal Analysis)

Track code quality evolution across commits, releases, or daily runs. Supports CSV export for academic and research analysis.

# Record a snapshot after analysis
sparkguardian analyze . --history

# View evolution
sparkguardian history
# ↘️ Latest delta: -8.5
# ⚠️ REGRESSION DETECTED (delta > -5)

# Export to CSV for plotting / longitudinal study
sparkguardian history --csv evolution.csv

# Filter by file
sparkguardian history --target src/pipeline.py

Snapshots stored in .sparkguardian-history.json with git commit/branch metadata for traceability.

SARIF Output (GitHub Code Scanning)

Native integration with GitHub Code Scanning, GitLab SAST, Azure DevOps, and SonarQube:

sparkguardian analyze . --sarif report.sarif

In .github/workflows/sparkguardian.yml:

- name: SparkGuardian
  run: sparkguardian analyze . --sarif sparkguardian.sarif
- uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: sparkguardian.sarif

Findings appear directly in the Security tab of GitHub with severity-graded annotations on pull requests.

Architecture

sparkguardian/
├── cli.py                    # Typer CLI (9 commands)
├── config.py                 # TOML/YAML loader
├── constants.py              # Exit codes, defaults
│
├── parser/
│   ├── ast_analyzer.py       # libcst static analysis
│   ├── plan_analyzer.py      # Execution plan analysis
│   ├── execution_parser.py   # Plan parser
│   ├── ignore_handler.py     # cy:ignore engine
│   └── node_extractors.py    # NodeType classification
│
├── rules/
│   ├── base.py               # BaseRule ABC
│   ├── registry.py           # @register decorator system
│   ├── static_rules.py       # 16 AST rules
│   ├── dynamic_rules.py      # 5 plan rules
│   ├── categories.py
│   └── severity.py
│
├── models/
│   ├── issue.py              # Issue, Location, Severity, ImpactType
│   ├── report.py             # AnalysisReport
│   └── execution_node.py     # ExecutionNode, NodeType
│
├── rewriter/
│   ├── safe_rewriter.py      # SafeRewriter (backup + apply)
│   ├── transformer.py        # libcst CSTTransformers
│   └── diff_generator.py     # Unified diff output
│
├── ai/
│   ├── __init__.py           # Public exports: LocalLLM, AISession, SparkKnowledgeBase…
│   ├── local_llm.py          # LocalLLM (cache + streaming + structured output)
│   ├── session.py            # AISession (multi-turn history, context injection)
│   ├── remediation.py        # RemediationPlanner (LLM + heuristic fallback)
│   ├── structured.py         # AIAnalysis, RemediationPlan dataclasses + JSON parser
│   ├── cache.py              # LRU AIResponseCache (128 entries, hit-rate tracking)
│   ├── rag.py                # SparkKnowledgeBase (24 entries, official Spark 4.1.1 docs)
│   └── prompts.py            # Versioned prompt library (system, structured, chat, remediation)
│
├── reporting/
│   ├── terminal_reporter.py  # Rich colored output
│   ├── json_reporter.py
│   ├── markdown_reporter.py
│   ├── html_reporter.py      # Dark-theme HTML
│   └── sarif_reporter.py     # SARIF v2.1.0 (GitHub Code Scanning)
│
├── scoring/
│   └── score_engine.py       # 0–100 performance score
│
├── cost/                     # NEW
│   └── estimator.py          # Cloud cost models (EMR/Databricks/GCP/Azure)
│
├── history/                  # NEW
│   └── tracker.py            # Longitudinal audit snapshots + CSV export
│
├── ci/
│   └── exit_codes.py         # resolve_exit_code()
│
└── utils/
    ├── logger.py
    └── file_utils.py         # backup_file, collect_python_files

Design for Future Rust Acceleration

The parser/ layer is intentionally isolated for future rewrite in Rust via PyO3 / maturin. The rule engine and reporting layers have no parser dependency, enabling drop-in acceleration without breaking the public API.

CI/CD Integration

GitHub Actions

- name: SparkGuardian Audit
  run: sparkguardian analyze . --json > report.json
  # Exit 1 on CRITICAL → build fails automatically

See .github/workflows/sparkguardian.yml for full example.

GitLab CI

sparkguardian:
  script: sparkguardian analyze . --json > report.json

See ci/gitlab-ci.yml.

Jenkins

See ci/Jenkinsfile for a pipeline with HTML report archiving.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_ast.py -v

# With coverage
pytest tests/ --cov=sparkguardian --cov-report=term-missing

373 tests | 0 failures | 0 warnings | 0 ruff errors | ~0.5s

Test file	Covers
`test_ai.py`	LLM cache, session, structured output, remediation, prompts
`test_ast.py`	All 16 static rules, analyzer edge cases, ignore system
`test_ci.py`	Exit codes (0/1/2)
`test_config.py`	TOML, YAML, defaults, rule overrides
`test_file_utils.py`	backup_file, collect_python_files
`test_ignore.py`	cy:ignore inline and block directives
`test_models.py`	Issue, Location, AnalysisReport, ExecutionNode
`test_node_extractors.py`	NodeType classification, plan parser
`test_plan.py`	Dynamic rules, execution plan analysis
`test_rag.py`	Knowledge base entries, keyword search, config params, sources
`test_registry.py`	Rule registry, @register decorator
`test_reporting.py`	JSON, Markdown, HTML reporters
`test_rewriter.py`	CST transformer, diff generator
`test_safe_rewriter.py`	SafeRewriter dry-run, apply, backup
`test_scoring.py`	0–100 score engine, penalty matrix
`test_cost.py`	Cloud cost estimator (4 clouds, workload scaling, severity weights)
`test_sarif.py`	SARIF v2.1.0 format compliance, severity mapping
`test_history.py`	Audit history, CSV export, regression detection, persistence

Supported Models (Local AI)

Model	Pull command	Notes
DeepSeek-Coder 6.7B	`ollama pull deepseek-coder:6.7b`	Recommended
CodeLlama 7B	`ollama pull codellama:7b`	Good code quality
Llama 3 8B	`ollama pull llama3:8b`	General purpose
Mistral 7B	`ollama pull mistral:7b`	Fast inference

SparkGuardian auto-selects the first available model from this list.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 19, 2026

This version

0.1.0

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkguardian-0.1.0.tar.gz (77.4 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sparkguardian-0.1.0-py3-none-any.whl (69.7 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file sparkguardian-0.1.0.tar.gz.

File metadata

Download URL: sparkguardian-0.1.0.tar.gz
Upload date: May 19, 2026
Size: 77.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for sparkguardian-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ffecd4968a184d3c6bafbfdde21f9a62fe68b9b735aadebf51382b654dc5121e`
MD5	`7da7ee6c6596c072cfde8fd6d7b99788`
BLAKE2b-256	`f4c2b2b04966590ec18aaabfb7213491684069bfa645abcaef5f4a6c272489c0`

See more details on using hashes here.

File details

Details for the file sparkguardian-0.1.0-py3-none-any.whl.

File metadata

Download URL: sparkguardian-0.1.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 69.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for sparkguardian-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5209e1a201c3200cfe78b88d74b4225c44b8b5bfdbb3ad9d8bbaaea2e1085f2a`
MD5	`16f3c38c5a7a62f2eba087a01a54ac28`
BLAKE2b-256	`21e5c964eca629c851bc7a99c79c09c4e69ebb08d884fab7028f32319741a632`

See more details on using hashes here.

sparkguardian 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SparkGuardian

Features

Installation

Build from source

Quick Start

Exit Codes

Rules

Static Rules — AST Analysis (16 rules)

Dynamic Rules — Execution Plan (5 rules)

Ignore System

Configuration

Local AI Setup

AI Commands

AI Architecture

Cloud Cost Estimator

Audit History (Longitudinal Analysis)

SARIF Output (GitHub Code Scanning)

Architecture

Design for Future Rust Acceleration

CI/CD Integration

GitHub Actions

GitLab CI

Jenkins

Development

Supported Models (Local AI)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes