LLM-based library for AI training job failure attribution and recommending auto-resume policy
Project description
LogSage
LogSage is an LLM-powered toolkit that analyzes AI training job logs, attributes root causes, and recommends an auto-resume policy to reduce wasted GPU time. It provides:
- Error extraction and de-duplication from SLURM job logs
- Root-cause attribution (e.g., hardware, configuration, memory, communication)
- Action recommendations (restart immediately vs stop) with justification
- Optional node isolation hints for hardware-related failures
Table of Contents
- Description
- Quickstart
- Components
- Configuration
- Testing
- Roadmap and Project Status
- Additional Resources
- Contributing
Description
- Problem: Training jobs fail for many reasons (hardware, networking, configuration, data). Manual root-cause analysis is slow and wastes GPU hours.
- Solution: LogSage uses log parsing + NVIDIA NIM to extract error patterns, attribute likely causes, and recommend whether to restart or stop, with an explanation.
- How it works (high level):
- Receive job logs (client-provided)
- Extract and cluster error lines; remove noise
- Attribute errors with an LLM using structured prompts and heuristics
- Recommend restart/stop and, when applicable, suggest temporal isolation of suspect nodes
- Benefits: Faster triage, increased data center availability, reduced GPU downtime, and improved error extraction coverage by leveraging LLMs.
Quickstart
Prerequisites:
- Python >= 3.9
- Poetry (recommended) or pip
Setup (using Poetry):
make install
# or
poetry install --with dev --all-extras
Install via pip:
python3 -m venv myenv
source myenv/bin/activate
pip install -U logsage --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple --extra-index-url https://pypi.org/simple
Components
logsage/auto_resume_policy/- CLI utilities to analyze local log files
- Recommendations engine that suggests job auto-resume policies: 'STOP', 'RESTART', 'TEMPORAL-ISOLATION+RESTART'
Configuration
Configuration is managed via logsage/auto_resume_policy/config.py (Pydantic settings). Key variables:
NVIDIA_API_KEY(required in production): API key for NVIDIA AI Endpoints (required for LLM calls)FAST_API_ROOT_PATH(optional): Root path when running behind a proxyDEBUG(optional):true/false(defaulttruelocally; setfalsein prod)
Testing
Run the test suite:
make test
# or
poetry run pytest
Coverage reports are configured in pyproject.toml.
Roadmap and Project Status
- Streamed/async log ingestion paths
- Integration with log collector like loki
- Expanded attribution categories and guardrails
- Improve test coverage and add end-to-end examples
Additional Resources
- Auto-Resume-Policy API & details:
logsage/auto_resume_policy/README.md - Fetcher & deployment details:
logsage/fetcher/README.md - Project configuration and developer tooling:
pyproject.toml,Makefile
Internal dashboards (if applicable):
- Grafana (LogSage):
https://grafana.nvidia.com/d/aeutclepcu41sf/logsage?orgId=290 - Kibana/ES example index (sandbox):
https://gpuwa.nvidia.com/elasticsearch/df-sandbox-ohazai-logsage-test2-202508/_search
Contributing
For setup, development guidelines, and versioning information, see the Contributing Guide.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file logsage-0.1.7-py3-none-any.whl.
File metadata
- Download URL: logsage-0.1.7-py3-none-any.whl
- Upload date:
- Size: 75.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
690e9f6dc56bf369b90aad91d7463e4c1689feb589148481955437ec8d33088a
|
|
| MD5 |
34a860f76c4ac68df5be1a33cb612ae8
|
|
| BLAKE2b-256 |
002941ca46b94399d55569b1a19d909115cfef47456b88bb302960d58e3fd1f3
|