Skip to main content

Universal Training Framework for PyTorch and HuggingFace Transformers

Project description

🛡️ Selgis ML

Autonomous Self-Healing Training Framework for PyTorch & Transformers.

PyPI License Python

Selgis (Self-Guided Intelligent Stability) is a library that turns unstable neural network training into a reliable, predictable process. It automatically detects Loss Spikes, NaN/Inf values, and plateaus, applying dynamic weight Rollback mechanisms and Learning Rate Surges to recover the run.

Especially effective for LoRA/QLoRA finetuning of LLMs (Llama, Qwen, Mistral) on consumer hardware, where standard trainers often crash with OutOfMemory errors or degrade due to fp16 instability.


🔥 Why Selgis?

Have you ever woken up in the morning to find your overnight run crashed with Loss: NaN at 80%? Or that the model "forgot" everything it learned due to a bad batch? Selgis solves this.

  • 🛡️ Self-Healing Loop: Automatic rollback to the last stable state upon detecting anomalies (loss spikes / NaN).
  • 🧠 Memory-Safe Architecture: State preservation logic tracks only trainable parameters (trainable-only). This allows training Qwen-4B / Llama-7B on cards with 8-12 GB VRAM without OOM during checkpoints.
  • ⚡ Final Surge: If the model gets stuck on a plateau, Selgis can automatically boost the LR by 5-10x to break through local minima ("defibrillator effect").
  • 📉 Smart Defaults: Built-in LR Finder and adaptive scheduler presets.

📊 Benchmarks

We tested Selgis under extreme conditions on real hardware (Tesla T4 16GB). Here are the results:

Task Model Problem Selgis Solution Result
LLM Finetuning Qwen-2.5-4B (QLoRA) OOM on 12GB cards + Loss Spike Trainable-only state + Rollback Memory: 8.2 GB, Loss < 0.001
Seq2Seq LSTM (1.4M) Catastrophic Spike (Acc 52% → 44%) Rollback + Surge +7% Accuracy (Recovered to 59.04%)
NLP BERT-base Instability on small batch (16) Stable LR Finder 100.0% Accuracy (in 3 epochs)
CV CNN (MNIST) Overfitting & micro-spikes Micro-rollbacks 99.09% (Held at generalization peak)

"Selgis doesn't just prevent explosions. It returns training to a productive track."


🚀 Installation

# Base version (PyTorch only)
pip install selgis

# Full version (with Transformers, LoRA, quantization, and WandB support)
pip install "selgis[all]"

🛠️ Quick Start

1. Robust LLM Training (Llama / Qwen)

Selgis handles protection while you use the familiar Transformers API.

from transformers import AutoModelForCausalLM, AutoTokenizer
from selgis import TransformerTrainer, TransformerConfig

# Configuration with protection enabled
config = TransformerConfig(
    model_name_or_path="Qwen/Qwen-2.5-3B",
    use_peft=True,
    peft_config={
        "r": 8, 
        "target_modules": ["q_proj", "v_proj"]
    },
    
    # Enable Selgis protection
    nan_recovery=True,      # Auto-rollback on NaN/Spike
    state_storage="disk",   # Save RAM (store state on disk)
    patience=3              # Wait 3 epochs of stagnation before intervention
)

# Load model (4-bit for memory efficiency)
model = AutoModelForCausalLM.from_pretrained(
    config.model_name_or_path, 
    load_in_4bit=True, 
    device_map="auto"
)

# Start training
trainer = TransformerTrainer(model, config, train_loader)
trainer.train() 
# You can go to sleep now. If the loss spikes, Selgis fixes it.

2. Standard PyTorch (Any Model)

from selgis import Trainer, SelgisConfig
import torch

# Your model
model = torch.nn.Sequential(
    torch.nn.Linear(10, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 2),
)

# Config
config = SelgisConfig(
    max_epochs=10,
    lr_finder_enabled=True,  # Auto-find optimal LR before start
    spike_threshold=3.0      # Rollback if loss jumps 3x
)

trainer = Trainer(
    model=model, 
    config=config, 
    train_dataloader=loader, 
    criterion=torch.nn.CrossEntropyLoss()
)
trainer.train()

💻 CLI (Command Line Interface)

Selgis ships with a handy CLI for diagnostics and quick execution.

Command Description
selgis device Check GPU/CUDA/MPS availability and print device info.
selgis train Run a minimal demo training on synthetic data (Smoke Test).
selgis train --config <path> Run training using a config file (JSON supported, YAML coming soon).
selgis version Print the current library version.

Example environment check:

$ selgis device
🚀 Device: cuda
   GPU: NVIDIA Tesla T4
   Memory: 14.75 GB

📚 API Reference

Full technical documentation for SelgisCore, Trainer, Callbacks, and configuration classes is available in API.md.

Key components:

  • SelgisCore: The brain of the system (protection, rollback, state management).
  • TransformerTrainer: Wrapper for the HuggingFace ecosystem.
  • LRFinder: Tool for finding the optimal learning rate.

📄 License

Apache 2.0 License. Free for commercial and research use.

Selgis AI — Make training boring (in a good way).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selgis-0.1.0.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selgis-0.1.0-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file selgis-0.1.0.tar.gz.

File metadata

  • Download URL: selgis-0.1.0.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for selgis-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9125c6c2e3b11492ec47523f3af162dde35c71dc581e07743a1f3c06605fa61c
MD5 6b8ca3a456665351b02795c3e90582b7
BLAKE2b-256 1df3fe82085642f5073e87cfb6f33b0954410d3afcb874753df903fbd2f8d534

See more details on using hashes here.

File details

Details for the file selgis-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: selgis-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for selgis-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3698ec3e04fd7cb5f5c59a738b2eb0393456af18303ab9843571b56e0cd6ea3d
MD5 56ff49aa3e6a17503b5d0d0e1ee56490
BLAKE2b-256 95f2939e3b8c79fd3934269976a1bd0f4cb473b9b4d6c33f532692b2aa3819d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page