Skip to main content

LLM-based library for AI training job failure attribution and recommending auto-resume policy

Project description

LogSage

LogSage is an LLM-powered toolkit that analyzes AI training job logs, attributes root causes, and recommends an auto-resume policy to reduce wasted GPU time. It provides:

  • Error extraction and de-duplication from SLURM job logs
  • Root-cause attribution (e.g., hardware, configuration, memory, communication)
  • Action recommendations (restart immediately vs stop) with justification
  • Optional node isolation hints for hardware-related failures

Table of Contents

Description

  • Problem: Training jobs fail for many reasons (hardware, networking, configuration, data). Manual root-cause analysis is slow and wastes GPU hours.
  • Solution: LogSage uses log parsing + NVIDIA NIM to extract error patterns, attribute likely causes, and recommend whether to restart or stop, with an explanation.
  • How it works (high level):
    1. Receive job logs (client-provided)
    2. Extract and cluster error lines; remove noise
    3. Attribute errors with an LLM using structured prompts and heuristics
    4. Recommend restart/stop and, when applicable, suggest temporal isolation of suspect nodes
  • Benefits: Faster triage, increased data center availability, reduced GPU downtime, and improved error extraction coverage by leveraging LLMs.

Quickstart

Prerequisites:

  • Python >= 3.9
  • Poetry (recommended) or pip

Setup (using Poetry):

make install
# or
poetry install --with dev --all-extras

Install via pip:

python3 -m venv myenv
source myenv/bin/activate
pip install -U logsage --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple --extra-index-url https://pypi.org/simple

Components

  • logsage/auto_resume_policy/
    • CLI utilities to analyze local log files
    • Recommendations engine that suggests job auto-resume policies: 'STOP', 'RESTART', 'TEMPORAL-ISOLATION+RESTART'

Configuration

Configuration is managed via logsage/auto_resume_policy/config.py (Pydantic settings). Key variables:

  • NVIDIA_API_KEY (required in production): API key for NVIDIA AI Endpoints (required for LLM calls)
  • FAST_API_ROOT_PATH (optional): Root path when running behind a proxy
  • DEBUG (optional): true/false (default true locally; set false in prod)

Testing

Run the test suite:

make test
# or
poetry run pytest

Coverage reports are configured in pyproject.toml.

Roadmap and Project Status

  • Streamed/async log ingestion paths
  • Integration with log collector like loki
  • Expanded attribution categories and guardrails
  • Improve test coverage and add end-to-end examples

Additional Resources

Internal dashboards (if applicable):

  • Grafana (LogSage): https://grafana.nvidia.com/d/aeutclepcu41sf/logsage?orgId=290
  • Kibana/ES example index (sandbox): https://gpuwa.nvidia.com/elasticsearch/df-sandbox-ohazai-logsage-test2-202508/_search

Contributing

For setup, development guidelines, and versioning information, see the Contributing Guide.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logsage-0.1.2-py3-none-any.whl (55.7 kB view details)

Uploaded Python 3

File details

Details for the file logsage-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: logsage-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 55.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for logsage-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a40c4b49e4976f6b7d83558f48d7f361c550d4930c8f5d12358dd6d6c4904bd7
MD5 10776017177ebc39c551a6275c00f4d9
BLAKE2b-256 91772f865303541631a9d03abed1e67086531884313465d12ac3f4d0d09d11d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page