LLM-based library for AI training job failure attribution and recommending auto-resume policy

These details have not been verified by PyPI

Project description

LogSage

LogSage is an LLM-powered toolkit that analyzes AI training job logs, attributes root causes, and recommends an auto-resume policy to reduce wasted GPU time. It provides:

Error extraction and de-duplication from SLURM job logs
Root-cause attribution (e.g., hardware, configuration, memory, communication)
Action recommendations (restart immediately vs stop) with justification
Optional node isolation hints for hardware-related failures

Description
Quickstart
Components
Running the API
Configuration
Testing
Roadmap and Project Status
Additional Resources
Contributing

Description

Problem: Training jobs fail for many reasons (hardware, networking, configuration, data). Manual root-cause analysis is slow and wastes GPU hours.
Solution: LogSage uses log parsing + NVIDIA NIM to extract error patterns, attribute likely causes, and recommend whether to restart or stop, with an explanation.
How it works (high level):
1. Receive job logs (client-provided)
2. Extract and cluster error lines; remove noise
3. Attribute errors with an LLM using structured prompts and heuristics
4. Recommend restart/stop and, when applicable, suggest temporal isolation of suspect nodes
Benefits: Faster triage, increased data center availability, reduced GPU downtime, and improved error extraction coverage by leveraging LLMs.

Quickstart

Prerequisites:

Python >= 3.11
Poetry (recommended) or pip

Setup (using Poetry):

make install
# or
poetry install --with dev --all-extras

Install via pip:

python3 -m venv myenv
source myenv/bin/activate
pip install -U logsage --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple --extra-index-url https://pypi.org/simple

Run the API locally:

python -m logsage.auto_resume_policy.run_server
# or
uvicorn logsage.auto_resume_policy.server:app --host 0.0.0.0 --port 8000 --reload

Open the docs:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Components

logsage/auto_resume_policy/
- FastAPI server exposing endpoints to create an attribution ID, ingest logs, and retrieve analysis
- CLI utilities to analyze local log files
- Recommendations engine that suggests job auto-resume policies: 'STOP', 'RESTART', 'TEMPORAL-ISOLATION+RESTART'
- Detailed API and usage docs: see logsage/auto_resume_policy/README.md

Running the API

The FastAPI service is defined in logsage/auto_resume_policy/server.py and can be run locally during development:

python -m logsage.auto_resume_policy.run_server

Key endpoints (full specs in module README and Swagger):

GET /healthz: liveness check
GET /version: version info
POST /errors/attribution_id: create an attribution ID for a job
POST /errors/logs: submit logs under the attribution ID
POST /errors/attribution: run attribution and get recommendation

For request/response schemas and cURL examples, see logsage/auto_resume_policy/README.md.

Configuration

Configuration is managed via logsage/auto_resume_policy/config.py (Pydantic settings). Key variables:

NVIDIA_API_KEY (required in production): API key for NVIDIA AI Endpoints (required for LLM calls)
FAST_API_ROOT_PATH (optional): Root path when running behind a proxy
DEBUG (optional): true/false (default true locally; set false in prod)

Testing

Run the test suite:

make test
# or
poetry run pytest

Coverage reports are configured in pyproject.toml.

Roadmap and Project Status

Add API authentication and rate limiting
Streamed/async log ingestion paths
Integration with log collector like loki
Expanded attribution categories and guardrails
Improve test coverage and add end-to-end examples

Additional Resources

Auto-Resume-Policy API & details: logsage/auto_resume_policy/README.md
Fetcher & deployment details: logsage/fetcher/README.md
Project configuration and developer tooling: pyproject.toml, Makefile

Internal dashboards (if applicable):

Grafana (LogSage): https://grafana.nvidia.com/d/aeutclepcu41sf/logsage?orgId=290
Kibana/ES example index (sandbox): https://gpuwa.nvidia.com/elasticsearch/df-sandbox-ohazai-logsage-test2-202508/_search

Contributing

For setup, development guidelines, and versioning information, see the Contributing Guide.

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.1.7

Apr 13, 2026

0.1.5

Jan 25, 2026

0.1.4

Jan 12, 2026

0.1.3

Jan 7, 2026

0.1.2

Nov 24, 2025

This version

0.1.1

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

logsage-0.1.1-py3-none-any.whl (56.2 kB view details)

Uploaded Nov 18, 2025 Python 3

File details

Details for the file logsage-0.1.1-py3-none-any.whl.

File metadata

Download URL: logsage-0.1.1-py3-none-any.whl
Upload date: Nov 18, 2025
Size: 56.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for logsage-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`872f2a530b9080561dd2a56e24f58b19ffa14e4e06a92b037481ecedabd8ba9a`
MD5	`2764084048b0e6aa045b1d96502c1e21`
BLAKE2b-256	`a091c158dc4c8fb7f17e5631c4b41be8603ea5f67881a4a716e1f82357b07312`

See more details on using hashes here.

logsage 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers