Universal Training Framework for PyTorch and HuggingFace Transformers
Project description
🛡️ Selgis ML
Autonomous Self-Healing Training Framework for PyTorch & Transformers.
Selgis (Self-Guided Intelligent Stability) is a library that turns unstable neural network training into a reliable, predictable process. It automatically detects Loss Spikes, NaN/Inf values, and plateaus, applying dynamic weight Rollback mechanisms and Learning Rate Surges to recover the run.
Especially effective for LoRA/QLoRA finetuning of LLMs (Llama, Qwen, Mistral) on consumer hardware, where standard trainers often crash with OutOfMemory errors or degrade due to fp16 instability.
🔥 Why Selgis?
Have you ever woken up in the morning to find your overnight run crashed with Loss: NaN at 80%? Or that the model "forgot" everything it learned due to a bad batch? Selgis solves this.
- 🛡️ Self-Healing Loop: Automatic rollback to the last stable state upon detecting anomalies (loss spikes / NaN).
- 🧠 Memory-Safe Architecture: State preservation logic tracks only trainable parameters (
trainable-only). This allows training Qwen-4B / Llama-7B on cards with 8-12 GB VRAM without OOM during checkpoints. - ⚡ Final Surge: If the model gets stuck on a plateau, Selgis can automatically boost the LR by 5-10x to break through local minima ("defibrillator effect").
- 📉 Smart Defaults: Built-in LR Finder and adaptive scheduler presets.
📊 Benchmarks
We tested Selgis under extreme conditions on real hardware (Tesla T4 16GB). Here are the results:
| Task | Model | Problem | Selgis Solution | Result |
|---|---|---|---|---|
| LLM Finetuning | Qwen-2.5-4B (QLoRA) | OOM on 12GB cards + Loss Spike | Trainable-only state + Rollback | Memory: 8.2 GB, Loss < 0.001 |
| Seq2Seq | LSTM (1.4M) | Catastrophic Spike (Acc 52% → 44%) | Rollback + Surge | +7% Accuracy (Recovered to 59.04%) |
| NLP | BERT-base | Instability on small batch (16) | Stable LR Finder | 100.0% Accuracy (in 3 epochs) |
| CV | CNN (MNIST) | Overfitting & micro-spikes | Micro-rollbacks | 99.09% (Held at generalization peak) |
"Selgis doesn't just prevent explosions. It returns training to a productive track."
🚀 Installation
# Base version (PyTorch only)
pip install selgis
# Full version (with Transformers, LoRA, quantization, and WandB support)
pip install "selgis[all]"
🛠️ Quick Start
1. Robust LLM Training (Llama / Qwen)
Selgis handles protection while you use the familiar Transformers API. Now with native BitsAndBytes quantization support.
from selgis import TransformerTrainer, TransformerConfig
# Configuration with native 4-bit quantization and protection
config = TransformerConfig(
model_name_or_path="Qwen/Qwen-2.5-3B",
# --- Native Quantization (New in v0.2.0) ---
quantization_type="4bit",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
# --- PEFT / LoRA ---
use_peft=True,
peft_config={
"r": 16,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
},
# --- Selgis protection ---
nan_recovery=True, # Auto-rollback on NaN/Spike
state_storage="disk" # Save RAM (store state on disk)
)
# Start training (Trainer handles model loading and quantization automatically)
trainer = TransformerTrainer(model_or_path=config.model_name_or_path, config=config)
trainer.train()
# You can go to sleep now. If the loss spikes, Selgis fixes it.
2. Standard PyTorch (Any Model)
from selgis import Trainer, SelgisConfig
import torch
# Your model
model = torch.nn.Sequential(
torch.nn.Linear(10, 32),
torch.nn.ReLU(),
torch.nn.Linear(32, 2),
)
# Config
config = SelgisConfig(
max_epochs=10,
lr_finder_enabled=True, # Auto-find optimal LR before start
spike_threshold=3.0 # Rollback if loss jumps 3x
)
trainer = Trainer(
model=model,
config=config,
train_dataloader=loader,
criterion=torch.nn.CrossEntropyLoss()
)
trainer.train()
💻 CLI (Command Line Interface)
Selgis ships with a handy CLI for diagnostics and quick execution.
| Command | Description |
|---|---|
selgis device |
Check GPU/CUDA/MPS availability and print device info. |
selgis train |
Run a minimal demo training on synthetic data (Smoke Test). |
selgis train --config <path> |
Run training using a config file (YAML/JSON supported). |
selgis version |
Print the current library version. |
Example environment check:
$ selgis device
🚀 Device: cuda
GPU: NVIDIA Tesla T4
Memory: 14.75 GB
📚 API Reference
Full technical documentation for SelgisCore, Trainer, Callbacks, and configuration classes is available in API.md.
Key components:
- SelgisCore: The brain of the system (protection, rollback, state management).
- TransformerTrainer: Wrapper for the HuggingFace ecosystem with native BitsAndBytes support.
- HistoryCallback: Automatically saves training history to JSON for later analysis.
- LRFinder: Tool for finding the optimal learning rate.
📄 License
Apache 2.0 License. Free for commercial and research use.
Selgis AI — Make training boring (in a good way).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file selgis-0.2.0.tar.gz.
File metadata
- Download URL: selgis-0.2.0.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04a736858e9abd3d65f3ce884b3e31f494a2b6c4a9c8ea916fc46050e0aae79d
|
|
| MD5 |
6d3871bb47ba153db89cc0fdcbcf0811
|
|
| BLAKE2b-256 |
b0099e6e275b981449eb1ddd0ce492652b448778e11d6d6b24263f5af364655c
|
File details
Details for the file selgis-0.2.0-py3-none-any.whl.
File metadata
- Download URL: selgis-0.2.0-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e19101ec43306fd499a664d45465d74d72c8a8cc2af72f05a9f1550584c25168
|
|
| MD5 |
c05d10ed72dc5a42e5eed3c5fb3280df
|
|
| BLAKE2b-256 |
3ed4705d401640c232ebedcf47ff3823e0e41e0b0b11aa091359fc19451b7da0
|