Skip to main content

Universal Training Framework for PyTorch and HuggingFace Transformers

Project description

🛡️ Selgis ML

Autonomous Self-Healing Training Framework for PyTorch & Transformers.

PyPI License Python

Selgis (Self-Guided Intelligent Stability) is a library that turns unstable neural network training into a reliable, predictable process. It automatically detects Loss Spikes, NaN/Inf values, and plateaus, applying dynamic weight Rollback mechanisms and Learning Rate Surges to recover the run.

Especially effective for LoRA/QLoRA finetuning of LLMs (Llama, Qwen, Mistral) on consumer hardware, where standard trainers often crash with OutOfMemory errors or degrade due to fp16 instability.


🔥 Why Selgis?

Have you ever woken up in the morning to find your overnight run crashed with Loss: NaN at 80%? Or that the model "forgot" everything it learned due to a bad batch? Selgis solves this.

  • 🛡️ Self-Healing Loop: Automatic rollback to the last stable state upon detecting anomalies (loss spikes / NaN).
  • 🧠 Memory-Safe Architecture: State preservation logic tracks only trainable parameters (trainable-only). This allows training Qwen-4B / Llama-7B on cards with 8-12 GB VRAM without OOM during checkpoints.
  • ⚡ Final Surge: If the model gets stuck on a plateau, Selgis can automatically boost the LR by 5-10x to break through local minima ("defibrillator effect").
  • 📉 Smart Defaults: Built-in LR Finder and adaptive scheduler presets.

📊 Benchmarks

We tested Selgis under extreme conditions on real hardware (Tesla T4 16GB). Here are the results:

Task Model Problem Selgis Solution Result
LLM Finetuning Qwen-2.5-4B (QLoRA) OOM on 12GB cards + Loss Spike Trainable-only state + Rollback Memory: 8.2 GB, Loss < 0.001
Seq2Seq LSTM (1.4M) Catastrophic Spike (Acc 52% → 44%) Rollback + Surge +7% Accuracy (Recovered to 59.04%)
NLP BERT-base Instability on small batch (16) Stable LR Finder 100.0% Accuracy (in 3 epochs)
CV CNN (MNIST) Overfitting & micro-spikes Micro-rollbacks 99.09% (Held at generalization peak)

"Selgis doesn't just prevent explosions. It returns training to a productive track."


🚀 Installation

# Base version (PyTorch only)
pip install selgis

# Full version (with Transformers, LoRA, quantization, and WandB support)
pip install "selgis[all]"

🛠️ Quick Start

1. Robust LLM Training (Llama / Qwen)

Selgis handles protection while you use the familiar Transformers API. Now with native BitsAndBytes quantization support.

from selgis import TransformerTrainer, TransformerConfig

# Configuration with native 4-bit quantization and protection
config = TransformerConfig(
    model_name_or_path="Qwen/Qwen-2.5-3B",
    
    # --- Native Quantization (New in v0.2.0) ---
    quantization_type="4bit", 
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
    
    # --- PEFT / LoRA ---
    use_peft=True,
    peft_config={
        "r": 16, 
        "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
    },
    
    # --- Selgis protection ---
    nan_recovery=True,      # Auto-rollback on NaN/Spike
    state_storage="disk",   # Save RAM (store state on disk)

    # --- CPU Offload (New) ---
    cpu_offload=True,       # Offload optimizer states/gradients to CPU
)

# Start training (Trainer handles model loading and quantization automatically)
trainer = TransformerTrainer(model_or_path=config.model_name_or_path, config=config)
trainer.train() 
# You can go to sleep now. If the loss spikes, Selgis fixes it.

2. Standard PyTorch (Any Model)

from selgis import Trainer, SelgisConfig
import torch

# Your model
model = torch.nn.Sequential(
    torch.nn.Linear(10, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 2),
)

# Config
config = SelgisConfig(
    max_epochs=10,
    lr_finder_enabled=True,  # Auto-find optimal LR before start
    spike_threshold=3.0,     # Rollback if loss jumps 3x

    # --- CPU Offload (New) ---
    cpu_offload=True,        # Offload optimizer states/gradients to CPU
)

trainer = Trainer(
    model=model, 
    config=config, 
    train_dataloader=loader, 
    criterion=torch.nn.CrossEntropyLoss()
)
trainer.train()

💻 CLI (Command Line Interface)

Selgis ships with a handy CLI for diagnostics and quick execution.

Command Description
selgis device Check GPU/CUDA/MPS availability and print device info.
selgis train Run a minimal demo training on synthetic data (Smoke Test).
selgis train --config <path> Run training using a config file (YAML/JSON supported).
selgis version Print the current library version.

Example environment check:

$ selgis device
🚀 Device: cuda
   GPU: NVIDIA Tesla T4
   Memory: 14.75 GB

📚 API Reference

Full technical documentation for SelgisCore, Trainer, Callbacks, and configuration classes is available in API.md.

Key components:

  • SelgisCore: The brain of the system (protection, rollback, state management).
  • TransformerTrainer: Wrapper for the HuggingFace ecosystem with native BitsAndBytes support.
  • HistoryCallback: Automatically saves training history to JSON for later analysis.
  • LRFinder: Tool for finding the optimal learning rate.

📄 License

Apache 2.0 License. Free for commercial and research use.

Selgis AI — Make training boring (in a good way).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selgis-0.2.1.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selgis-0.2.1-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file selgis-0.2.1.tar.gz.

File metadata

  • Download URL: selgis-0.2.1.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for selgis-0.2.1.tar.gz
Algorithm Hash digest
SHA256 84820ef5d8713cefc5b043e84af6c21303ad7891c8711220e50437787546a40f
MD5 29d6c32f566a1abcd1eb1fa67a182a28
BLAKE2b-256 725acd4691fb18ff774782dcd94b8f049c053889d9e199040c24732af0d79ef3

See more details on using hashes here.

File details

Details for the file selgis-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: selgis-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for selgis-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 59cb886485d2b02cdaa96bf9a53e400eea1b47d18a308e5275e031548467e5d1
MD5 c5db4fdf293a80e9e9dc8bfff939e5c6
BLAKE2b-256 c7dc11c4b74b7bbcd19fac06d772dc31aaee6f3f089c062aa41df770c91ca71e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page