Audit/evaluate SLURM HPC jobs by parsing static artifacts (stdout/stderr/log/config files) and producing human- and machine-readable reports.
Project description
jps-slurm-utils
Audit/evaluate SLURM HPC jobs by parsing static artifacts (stdout/stderr/log/config files) and producing human- and machine-readable reports.
๐ Overview
jps-slurm-job-audit is a powerful offline SLURM job audit tool that analyzes job artifacts without requiring cluster access. It provides:
- Automated failure detection: Detects OOM errors, timeouts, segfaults, Python/Java/R exceptions, filesystem errors, and more
- Metadata extraction: Parses SBATCH directives and job information from scripts and filenames
- Resource utilization tracking: Extracts metrics from seff/sacct outputs when available
- Structured reporting: Generates JSON reports with evidence snippets and remediation guidance
- Batch processing: Analyze hundreds of jobs and generate aggregate summaries
- Exit codes: 0=OK, 1=WARN, 2=FAIL, 3+=tool error
Features
- โ Offline analysis - No cluster access needed, works with copied artifacts
- โ Pattern-based detection - Built-in rules for common HPC failure modes
- โ Streaming scanner - Efficiently handles large log files without loading into memory
- โ Evidence capture - Stores relevant log excerpts with line numbers and context
- โ Configurable discovery - Flexible glob/regex patterns for file matching
- โ Rich terminal output - Pretty tables and color-coded summaries
- โ Machine-readable reports - JSON/CSV outputs for downstream analytics
- โ Extensible - Plugin architecture for custom detectors (future milestone)
Example Usage
Audit a single job directory:
jps-slurm-job-audit single --job-dir /path/to/job/artifacts
Output:
INFO: Starting audit of job directory: /path/to/job/artifacts
INFO: Phase 1: Discovering artifacts...
INFO: Discovered 5 files in /path/to/job/artifacts
INFO: Phase 2: Extracting metadata...
INFO: Phase 3: Detecting failure patterns...
INFO: Found 2 issues across 2 files
INFO: Phase 4: Extracting metrics...
INFO: Phase 5: Computing final status...
INFO: Audit complete. Status: FAIL, Score: 40
โ Audit complete!
Report saved to: /tmp/user/jps-slurm-job-audit/20240115_142330/report.json
โโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโ
โ Field โ Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Job ID โ 123456 โ
โ Job Name โ example_job โ
โ Status โ FAIL โ
โ Findings โ 2 โ
โ Files Scannedโ 5 โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
Findings:
โข Python Exception: Detected python exception in example-123456.out (1 occurrences)
โข Out of Memory: Detected out of memory in slurm-654321.out (3 occurrences)
Audit multiple jobs in batch:
# Create a file with job directory paths
cat > job_dirs.txt <<EOF
/path/to/job1
/path/to/job2
/path/to/job3
EOF
jps-slurm-job-audit batch --path-list job_dirs.txt --outdir ./results
Advanced filtering:
# Only scan specific file types
jps-slurm-job-audit single --job-dir ./job --glob "*.out"
# Include/exclude patterns
jps-slurm-job-audit single --job-dir ./job \
--include "slurm-.*\.(out|err)" \
--exclude "backup"
# Custom output location
jps-slurm-job-audit single --job-dir ./job \
--outdir ./my-reports \
--logfile ./my-reports/audit.log
# Verbose logging
jps-slurm-job-audit single --job-dir ./job --verbose
# Quiet mode (no console output)
jps-slurm-job-audit single --job-dir ./job --quiet
Batch with filtering:
# Only show failed jobs
jps-slurm-job-audit batch --path-list jobs.txt --only FAIL
Report Structure
The JSON report includes:
{
"tool_version": "0.1.0",
"run_timestamp": "2024-01-15T14:23:30",
"job_metadata": {
"job_id": "123456",
"job_name": "example_job",
"partition": "compute",
"nodes": 2,
"ntasks": 16,
"cpus_per_task": 2,
"mem": "64G",
"time_limit": "12:00:00"
},
"discovered_files": [...],
"findings": [
{
"id": "python_exception_example-123456.out",
"category": "Python Exception",
"severity": "ERROR",
"message": "Detected python exception in example-123456.out (1 occurrences)",
"confidence": 0.9,
"remediation": "Review Python traceback and fix the reported error in your code.",
"evidence": [
{
"file": "/path/to/example-123456.out",
"line_start": 12,
"excerpt": "ValueError: invalid literal for int() with base 10: 'NaN'",
"match_pattern": "(?i)^\\w+Error:",
"context_before": [
" File \"/path/to/application.py\", line 156, in process_data",
" result = transform(data)"
]
}
]
}
],
"metrics": {
"walltime_used": "00:45:23",
"memory_utilized": "58.2 GB",
"cpu_efficiency": 87.5
},
"final_status": "FAIL",
"score": 40,
"rules_used": ["built-in"]
}
๐ฆ Installation
From source:
git clone https://github.com/jai-python3/jps-slurm-utils
cd jps-slurm-utils
make install
Using pip (when published):
pip install jps-slurm-utils
For development:
make install-dev
๐งช Development
# Format and lint code
make fix && make format && make lint
# Run tests
make test
# Run tests with coverage
make test-cov
# Run all checks
make all
๐บ๏ธ Roadmap
This implements Milestones 0-3 from the SRS:
- โ Project skeleton with Typer CLI
- โ Artifact discovery and metadata normalization
- โ Error/failure classification with evidence capture
- โ Resource utilization inference with anomaly detection
Future milestones:
- Milestone 4: External YAML rule packs
- Milestone 5: Batch aggregation analytics
- Milestone 6: Job comparison/diff command
- Milestone 7: Plugin architecture
๐ค Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass (
make test) - Submit a pull request
๐ License
MIT License ยฉ Jaideep Sundaram
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jps_slurm_utils-0.2.0.tar.gz.
File metadata
- Download URL: jps_slurm_utils-0.2.0.tar.gz
- Upload date:
- Size: 28.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1701479fe67c103a04b0111e6a586b9f6a1aa720946adfdfcc86b45b3253edd2
|
|
| MD5 |
e2760dc10a7e9f58b9fe17d4b055fc30
|
|
| BLAKE2b-256 |
54fb41e15db321e3e9b52b589fc1d5ddcb472497d060f63a5df13586dd19cfef
|
File details
Details for the file jps_slurm_utils-0.2.0-py3-none-any.whl.
File metadata
- Download URL: jps_slurm_utils-0.2.0-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22f78486fbae1c2398407f465e5583a095ca6726631f2d0ee5714346ec5375f8
|
|
| MD5 |
9b2a642a17c6dd9e2a568e9f44b1590e
|
|
| BLAKE2b-256 |
a2aae32003d1c6a1b86c5c4645d3b2aa63382352e51b412f3105aa607ad50bf3
|