LLM-based library for AI training job failure attribution and recommending auto-resume policy
Project description
LogSage
LogSage is an LLM-powered toolkit that analyzes AI training job logs, attributes root causes, and recommends an auto-resume policy to reduce wasted GPU time. It provides:
- Error extraction and de-duplication from SLURM job logs
- Root-cause attribution (e.g., hardware, configuration, memory, communication)
- Action recommendations (restart immediately vs stop) with justification
- Optional node isolation hints for hardware-related failures
Table of Contents
- Description
- Quickstart
- Components
- Running the API
- Configuration
- Testing
- Roadmap and Project Status
- Additional Resources
- Contributing
Description
- Problem: Training jobs fail for many reasons (hardware, networking, configuration, data). Manual root-cause analysis is slow and wastes GPU hours.
- Solution: LogSage uses log parsing + NVIDIA NIM to extract error patterns, attribute likely causes, and recommend whether to restart or stop, with an explanation.
- How it works (high level):
- Receive job logs (client-provided)
- Extract and cluster error lines; remove noise
- Attribute errors with an LLM using structured prompts and heuristics
- Recommend restart/stop and, when applicable, suggest temporal isolation of suspect nodes
- Benefits: Faster triage, increased data center availability, reduced GPU downtime, and improved error extraction coverage by leveraging LLMs.
Quickstart
Prerequisites:
- Python >= 3.11
- Poetry (recommended) or pip
Setup (using Poetry):
make install
# or
poetry install --with dev --all-extras
Install via pip:
python3 -m venv myenv
source myenv/bin/activate
pip install -U logsage --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple --extra-index-url https://pypi.org/simple
Run the API locally:
python -m logsage.auto_resume_policy.run_server
# or
uvicorn logsage.auto_resume_policy.server:app --host 0.0.0.0 --port 8000 --reload
Open the docs:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Components
logsage/auto_resume_policy/- FastAPI server exposing endpoints to create an attribution ID, ingest logs, and retrieve analysis
- CLI utilities to analyze local log files
- Recommendations engine that suggests job auto-resume policies: 'STOP', 'RESTART', 'TEMPORAL-ISOLATION+RESTART'
- Detailed API and usage docs: see
logsage/auto_resume_policy/README.md
Running the API
The FastAPI service is defined in logsage/auto_resume_policy/server.py and can be run locally during development:
python -m logsage.auto_resume_policy.run_server
Key endpoints (full specs in module README and Swagger):
GET /healthz: liveness checkGET /version: version infoPOST /errors/attribution_id: create an attribution ID for a jobPOST /errors/logs: submit logs under the attribution IDPOST /errors/attribution: run attribution and get recommendation
For request/response schemas and cURL examples, see logsage/auto_resume_policy/README.md.
Configuration
Configuration is managed via logsage/auto_resume_policy/config.py (Pydantic settings). Key variables:
NVIDIA_API_KEY(required in production): API key for NVIDIA AI Endpoints (required for LLM calls)FAST_API_ROOT_PATH(optional): Root path when running behind a proxyDEBUG(optional):true/false(defaulttruelocally; setfalsein prod)
Testing
Run the test suite:
make test
# or
poetry run pytest
Coverage reports are configured in pyproject.toml.
Roadmap and Project Status
- Add API authentication and rate limiting
- Streamed/async log ingestion paths
- Integration with log collector like loki
- Expanded attribution categories and guardrails
- Improve test coverage and add end-to-end examples
Additional Resources
- Auto-Resume-Policy API & details:
logsage/auto_resume_policy/README.md - Fetcher & deployment details:
logsage/fetcher/README.md - Project configuration and developer tooling:
pyproject.toml,Makefile
Internal dashboards (if applicable):
- Grafana (LogSage):
https://grafana.nvidia.com/d/aeutclepcu41sf/logsage?orgId=290 - Kibana/ES example index (sandbox):
https://gpuwa.nvidia.com/elasticsearch/df-sandbox-ohazai-logsage-test2-202508/_search
Contributing
For setup, development guidelines, and versioning information, see the Contributing Guide.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file logsage-0.1.1-py3-none-any.whl.
File metadata
- Download URL: logsage-0.1.1-py3-none-any.whl
- Upload date:
- Size: 56.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
872f2a530b9080561dd2a56e24f58b19ffa14e4e06a92b037481ecedabd8ba9a
|
|
| MD5 |
2764084048b0e6aa045b1d96502c1e21
|
|
| BLAKE2b-256 |
a091c158dc4c8fb7f17e5631c4b41be8603ea5f67881a4a716e1f82357b07312
|