Skip to main content

Audit/evaluate SLURM HPC jobs by parsing static artifacts (stdout/stderr/log/config files) and producing human- and machine-readable reports.

Project description

jps-slurm-utils

Build Publish to PyPI codecov

Audit/evaluate SLURM HPC jobs by parsing static artifacts (stdout/stderr/log/config files) and producing human- and machine-readable reports.

๐Ÿš€ Overview

jps-slurm-job-audit is a powerful offline SLURM job audit tool that analyzes job artifacts without requiring cluster access. It provides:

  • Automated failure detection: Detects OOM errors, timeouts, segfaults, Python/Java/R exceptions, filesystem errors, and more
  • Metadata extraction: Parses SBATCH directives and job information from scripts and filenames
  • Resource utilization tracking: Extracts metrics from seff/sacct outputs when available
  • Structured reporting: Generates JSON reports with evidence snippets and remediation guidance
  • Batch processing: Analyze hundreds of jobs and generate aggregate summaries
  • Exit codes: 0=OK, 1=WARN, 2=FAIL, 3+=tool error

Features

  • โœ… Offline analysis - No cluster access needed, works with copied artifacts
  • โœ… Pattern-based detection - Built-in rules for common HPC failure modes
  • โœ… Streaming scanner - Efficiently handles large log files without loading into memory
  • โœ… Evidence capture - Stores relevant log excerpts with line numbers and context
  • โœ… Configurable discovery - Flexible glob/regex patterns for file matching
  • โœ… Rich terminal output - Pretty tables and color-coded summaries
  • โœ… Machine-readable reports - JSON/CSV outputs for downstream analytics
  • โœ… Extensible - Plugin architecture for custom detectors (future milestone)

Example Usage

Audit a single job directory:

jps-slurm-job-audit single --job-dir /path/to/job/artifacts

Output:

INFO: Starting audit of job directory: /path/to/job/artifacts
INFO: Phase 1: Discovering artifacts...
INFO: Discovered 5 files in /path/to/job/artifacts
INFO: Phase 2: Extracting metadata...
INFO: Phase 3: Detecting failure patterns...
INFO: Found 2 issues across 2 files
INFO: Phase 4: Extracting metrics...
INFO: Phase 5: Computing final status...
INFO: Audit complete. Status: FAIL, Score: 40

โœ“ Audit complete!
Report saved to: /tmp/user/jps-slurm-job-audit/20240115_142330/report.json

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Field        โ”ƒ Value          โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Job ID       โ”‚ 123456         โ”‚
โ”‚ Job Name     โ”‚ example_job    โ”‚
โ”‚ Status       โ”‚ FAIL           โ”‚
โ”‚ Findings     โ”‚ 2              โ”‚
โ”‚ Files Scannedโ”‚ 5              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Findings:
  โ€ข Python Exception: Detected python exception in example-123456.out (1 occurrences)
  โ€ข Out of Memory: Detected out of memory in slurm-654321.out (3 occurrences)

Audit multiple jobs in batch:

# Create a file with job directory paths
cat > job_dirs.txt <<EOF
/path/to/job1
/path/to/job2
/path/to/job3
EOF

jps-slurm-job-audit batch --path-list job_dirs.txt --outdir ./results

Advanced filtering:

# Only scan specific file types
jps-slurm-job-audit single --job-dir ./job --glob "*.out"

# Include/exclude patterns
jps-slurm-job-audit single --job-dir ./job \
  --include "slurm-.*\.(out|err)" \
  --exclude "backup"

# Custom output location
jps-slurm-job-audit single --job-dir ./job \
  --outdir ./my-reports \
  --logfile ./my-reports/audit.log

# Verbose logging
jps-slurm-job-audit single --job-dir ./job --verbose

# Quiet mode (no console output)
jps-slurm-job-audit single --job-dir ./job --quiet

Batch with filtering:

# Only show failed jobs
jps-slurm-job-audit batch --path-list jobs.txt --only FAIL

Report Structure

The JSON report includes:

{
  "tool_version": "0.1.0",
  "run_timestamp": "2024-01-15T14:23:30",
  "job_metadata": {
    "job_id": "123456",
    "job_name": "example_job",
    "partition": "compute",
    "nodes": 2,
    "ntasks": 16,
    "cpus_per_task": 2,
    "mem": "64G",
    "time_limit": "12:00:00"
  },
  "discovered_files": [...],
  "findings": [
    {
      "id": "python_exception_example-123456.out",
      "category": "Python Exception",
      "severity": "ERROR",
      "message": "Detected python exception in example-123456.out (1 occurrences)",
      "confidence": 0.9,
      "remediation": "Review Python traceback and fix the reported error in your code.",
      "evidence": [
        {
          "file": "/path/to/example-123456.out",
          "line_start": 12,
          "excerpt": "ValueError: invalid literal for int() with base 10: 'NaN'",
          "match_pattern": "(?i)^\\w+Error:",
          "context_before": [
            "  File \"/path/to/application.py\", line 156, in process_data",
            "    result = transform(data)"
          ]
        }
      ]
    }
  ],
  "metrics": {
    "walltime_used": "00:45:23",
    "memory_utilized": "58.2 GB",
    "cpu_efficiency": 87.5
  },
  "final_status": "FAIL",
  "score": 40,
  "rules_used": ["built-in"]
}

๐Ÿ“ฆ Installation

From source:

git clone https://github.com/jai-python3/jps-slurm-utils
cd jps-slurm-utils
make install

Using pip (when published):

pip install jps-slurm-utils

For development:

make install-dev

๐Ÿงช Development

# Format and lint code
make fix && make format && make lint

# Run tests
make test

# Run tests with coverage
make test-cov

# Run all checks
make all

๐Ÿ—บ๏ธ Roadmap

This implements Milestones 0-3 from the SRS:

  • โœ… Project skeleton with Typer CLI
  • โœ… Artifact discovery and metadata normalization
  • โœ… Error/failure classification with evidence capture
  • โœ… Resource utilization inference with anomaly detection

Future milestones:

  • Milestone 4: External YAML rule packs
  • Milestone 5: Batch aggregation analytics
  • Milestone 6: Job comparison/diff command
  • Milestone 7: Plugin architecture

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass (make test)
  5. Submit a pull request

๐Ÿ“œ License

MIT License ยฉ Jaideep Sundaram

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jps_slurm_utils-0.2.0.tar.gz (28.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jps_slurm_utils-0.2.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file jps_slurm_utils-0.2.0.tar.gz.

File metadata

  • Download URL: jps_slurm_utils-0.2.0.tar.gz
  • Upload date:
  • Size: 28.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for jps_slurm_utils-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1701479fe67c103a04b0111e6a586b9f6a1aa720946adfdfcc86b45b3253edd2
MD5 e2760dc10a7e9f58b9fe17d4b055fc30
BLAKE2b-256 54fb41e15db321e3e9b52b589fc1d5ddcb472497d060f63a5df13586dd19cfef

See more details on using hashes here.

File details

Details for the file jps_slurm_utils-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for jps_slurm_utils-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 22f78486fbae1c2398407f465e5583a095ca6726631f2d0ee5714346ec5375f8
MD5 9b2a642a17c6dd9e2a568e9f44b1590e
BLAKE2b-256 a2aae32003d1c6a1b86c5c4645d3b2aa63382352e51b412f3105aa607ad50bf3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page