A tool for analyzing Spark History Server data and identifying optimization opportunities

These details have not been verified by PyPI

Project description

Spark Analyzer

Analyze your Apache Spark applications for performance bottlenecks, resource inefficiencies, and optimization opportunities. Get actionable insights from your Spark History Server data with detailed stage-level analysis and recommendations.

Quick Start - No Registration: Run spark-analyzer analyze --save-local for immediate analysis. Enhanced Analysis (optional): Obtain your ID for cost estimation and deeper analysis of your data.

Features

Workflow Classification: Automatically categorizes Extract, Transform, and Load operations in Spark jobs
Stage-Level Analysis: Detailed performance metrics and resource utilization analysis
Storage Format Detection: Identifies Parquet, Delta, Hudi, and Iceberg usage patterns
Privacy Controls: Configurable data hashing for masking sensitive information
Local & Browser Modes: Support for direct access and browser-based Spark History Server interfaces
Offline Analysis: Run completely offline with --save-local flag for local data processing and reporting. No data leaves your environment.
Report Generation: Excel output with detailed optimization recommendations

What You'll Get

Sample Spark Analysis Report Real analysis showing 34.8% compute waste and 45-65% performance improvement opportunities

Ready to analyze your Spark applications?

Quick Start

# Install
pip install spark-analyzer

# Configure connection to your Spark History Server
spark-analyzer configure

# Local analysis (no registration required)
spark-analyzer analyze --save-local

# Generate Excel report
spark-analyzer report --input analysis_data.json --output my_report.xlsx

# Enhanced analysis with cost insights (if you have an ID)
spark-analyzer analyze

Installation

# Create and activate a virtual environment (recommended)
python3 -m venv myenv
source myenv/bin/activate  # On Windows: myenv\Scripts\activate

# Install the package
pip install spark-analyzer

Note: If you encounter "externally-managed-environment" errors, use a virtual environment as shown above.

Commands

The tool uses a subcommand structure for organization:

# Main commands
spark-analyzer                   # Default: runs analyze command
spark-analyzer analyze           # Explicit analyze command
spark-analyzer configure         # Configuration wizard
spark-analyzer report            # Generate acceleration reports
spark-analyzer readme            # Show the README documentation

# Get help
spark-analyzer --help            # Show general help
spark-analyzer analyze --help    # Show analyze command help
spark-analyzer configure --help  # Show configure command help
spark-analyzer report --help     # Show report command help

Usage

Basic Analysis

# Run with saved configuration
spark-analyzer

# Run with explicit analyze command
spark-analyzer analyze

# Run in local mode
spark-analyzer analyze --local

# Run in browser mode
spark-analyzer analyze --browser

# Run with local data saving (no upload)
spark-analyzer analyze --save-local

# Run with privacy protection
spark-analyzer analyze --opt-out name,description,details

Configuration

# Basic configuration
spark-analyzer configure

First-time users: The tool automatically runs the configuration wizard to set up connection mode and Spark History Server URL. Cost Estimator ID is optional for enhanced features.

Subsequent runs: Uses saved configuration and runs immediately.

Configuration file: Settings are saved to ~/.spark_analyzer/config.ini and can be manually edited if needed.

Runtime options (not saved in configuration):

--opt-out: Privacy protection settings
--save-local: Local data saving without upload

Documentation: Use spark-analyzer readme to view the full documentation in your terminal.

Report Generation

# Generate Excel report (no data leaves your environment)
spark-analyzer report --input your_data.json --output report.xlsx

# Batch processing multiple files
spark-analyzer report --input "*.json" --output-dir ./reports/

# Get help
spark-analyzer report --help

The Excel report provides comprehensive Spark performance analysis. Optional: Send your report to gtm@onehouse.ai for advanced optimization recommendations or technical consultation.

Connection Modes

The tool supports two connection modes:

Local Mode

Use when you have direct access to the Spark History Server (port forwarding or SSH tunnel).

Configure:
```
spark-analyzer configure
```

Manual Configuration (optional): Edit ~/.spark_analyzer/config.ini:

[server]
# Standard installation
base_url = http://localhost:18080

# Port forwarding
# base_url = http://localhost:8080/onehouse-spark-code/history-server

Run:

spark-analyzer analyze --save-local  # Local analysis
# or
spark-analyzer analyze               # Enhanced analysis (prompts for ID if needed)

Browser Mode

Use when accessing Spark History Server through a browser (e.g., EMR interface).

Configure:
```
spark-analyzer configure
```
Runtime Configuration: URLs and cookies are collected at runtime for security.
Get Browser Cookies:
- Open Spark History Server in browser
- Open developer tools (F12)
- Go to Network tab and click any request
- Find "Cookie" header in Request Headers
- Copy the entire cookie string

Run:

spark-analyzer analyze --save-local  # Local analysis
# or
spark-analyzer analyze               # Enhanced analysis (prompts for ID if needed)

Supported Platforms

The tool supports various Spark environments including EMR, Databricks, and other Spark History Server implementations. The configuration wizard will guide you through the specific setup steps for your environment.

Data Collection

Collected Information

Application-Level Data

Application ID, name, start/end times
Executor count and core allocation
CPU and memory usage metrics

Stage-Level Data

Stage ID, attempt information, names, descriptions
Task counts and executor assignments
Performance metrics (duration, CPU time)
Execution plans and stack traces

Derived Metrics

Workflow type classification (Extract, Transform, Load)
Storage format detection
Resource utilization analysis
Performance bottleneck identification

Privacy Controls

Use --opt-out to hash sensitive fields:

# Hash all potentially sensitive fields
spark-analyzer analyze --opt-out name,description,details

# Hash only stage names
spark-analyzer analyze --opt-out name

# Hash only stage descriptions
spark-analyzer analyze --opt-out description

Available Options:

name: Stage names → name_hash_[numeric_hash]
description: Stage descriptions → description_hash_[numeric_hash]
details: Execution plans → details_hash_[numeric_hash]

Example:

Original: "stage_name": "show at Console.scala:14"
Hashed: "stage_name": "name_hash_1002607082777652347"

Data Usage

Purpose: Performance analysis and optimization recommendations
Processing: Remote analysis data is processed by Onehouse's cloud infrastructure to generate personalized reports. With --save-local mode, all processing happens locally - no data is transmitted.
Security: Data transmitted over HTTPS by configuring cost estimator ID is stored in Onehouse's secure cloud environment
Retention: Analysis data is deleted once the report is emailed to you

Enterprise Considerations

Data Collection

Uses Python's built-in hash() function for consistent hashing
Tool does not collect any PII. Only stated data from Spark is processed/collected.
Stage content may contain file paths, table names, or query fragments. Turn on privacy options if they are sensitive.
Anonymous usage data collected via Scarf for tool improvement and analytics

Report Output

Excel Report Structure

Application Level: Overall metrics and acceleration summary
Workflow Level: Breakdown by workflow type (Extract, Transform, Load, Metadata)
Stage Level: Individual stage analysis with optimization recommendations

Troubleshooting

Common Issues

Externally-managed-environment: Use virtual environment
Connection Errors: Verify Spark History Server URL and port
Authentication Issues: Ensure fresh cookies for browser mode
Permission Errors: Check file write permissions for local mode

Debug Mode

# Enable debug logging
spark-analyzer configure --debug
spark-analyzer analyze --debug

Support

For questions, issues, or technical support, contact Onehouse support at gtm@onehouse.ai.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.31

Feb 16, 2026

0.1.30

Jan 6, 2026

0.1.29

Nov 19, 2025

0.1.28

Sep 5, 2025

0.1.27

Aug 29, 2025

0.1.26

Aug 28, 2025

0.1.25

Aug 28, 2025

0.1.24

Aug 28, 2025

This version

0.1.23

Aug 22, 2025

0.1.22

Aug 12, 2025

0.1.21

Aug 8, 2025

0.1.20

Aug 6, 2025

0.1.19

Aug 4, 2025

0.1.18

Jul 25, 2025

0.1.17

Jul 25, 2025

0.1.16

Jul 23, 2025

0.1.15

Jul 22, 2025

0.1.14

Jul 21, 2025

0.1.13

Jul 18, 2025

0.1.12

Jul 17, 2025

0.1.11

Jun 24, 2025

0.1.10

Jun 24, 2025

0.1.9

Jun 24, 2025

0.1.8

Jun 24, 2025

0.1.7

Jun 10, 2025

0.1.6

May 21, 2025

0.1.5

May 21, 2025

0.1.4

May 21, 2025

0.1.3

May 20, 2025

0.1.2

May 20, 2025

0.1.1

May 20, 2025

0.1.0

May 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_analyzer-0.1.23.tar.gz (447.0 kB view details)

Uploaded Aug 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spark_analyzer-0.1.23-py3-none-any.whl (453.6 kB view details)

Uploaded Aug 22, 2025 Python 3

File details

Details for the file spark_analyzer-0.1.23.tar.gz.

File metadata

Download URL: spark_analyzer-0.1.23.tar.gz
Upload date: Aug 22, 2025
Size: 447.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for spark_analyzer-0.1.23.tar.gz
Algorithm	Hash digest
SHA256	`b89f1a362b4866ac99d1f20c749b1181105c0013e8a258dddb07c7c9851ec68a`
MD5	`7345ff685ddbea3385fcc95f9ebca933`
BLAKE2b-256	`3d959de3e1aea4042d38a227551fea226d5231f4f53535f2077864af8f4170ed`

See more details on using hashes here.

File details

Details for the file spark_analyzer-0.1.23-py3-none-any.whl.

File metadata

Download URL: spark_analyzer-0.1.23-py3-none-any.whl
Upload date: Aug 22, 2025
Size: 453.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for spark_analyzer-0.1.23-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cdd736a260a6899c2e173aeead7cd7b466ad76d8e74a1217a74eaf821d403a81`
MD5	`16dfa8087add1b856c3cd359e4a18b4e`
BLAKE2b-256	`fec10a94f6d89d53c5d73cc98c7d9efee62534d5f512ed12c1f36e785f784889`

See more details on using hashes here.

spark-analyzer 0.1.23

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Spark Analyzer

Features

What You'll Get

Quick Start

Installation

Commands

Usage

Basic Analysis

Configuration

Report Generation

Connection Modes

Local Mode

Browser Mode

Supported Platforms

Data Collection

Collected Information

Application-Level Data

Stage-Level Data

Derived Metrics

Privacy Controls

Data Usage

Enterprise Considerations

Data Collection

Report Output

Excel Report Structure

Troubleshooting

Common Issues

Debug Mode

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes