Universal Training Framework for PyTorch and HuggingFace Transformers
Project description
Selgis ML
Universal Training Framework for PyTorch and HuggingFace Transformers.
Selgis (Self-Guided Intelligent Stability) is a training framework with automatic failure protection.
What is Selgis
03:47 — Training started.
07:00 — Loss: NaN. Training crashed.
07:01 — You realize: 8 hours of work are gone.
Neural network training is fragile. Loss spikes, NaN, OOM and plateaus can destroy hours of computation. Standard trainers log the error and stop — you debug and restart manually.
Selgis automatically:
- Detects anomalies (NaN, spikes)
- Rolls back to stable state
- Lowers learning rate
- Continues training without your intervention
Installation
# Base (PyTorch only)
pip install selgis
# Full (Transformers, LoRA, quantization)
pip install "selgis[all]"
# Unsloth support (recommended for LLM training)
pip install unsloth
Quick Start
Any PyTorch model
from selgis import Trainer, SelgisConfig
from torch.utils.data import DataLoader
config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, train_dataloader)
trainer.train()
LLM with LoRA
from selgis import TransformerTrainer, TransformerConfig
config = TransformerConfig(
model_name_or_path="Qwen/Qwen2-0.5B",
use_peft=True,
peft_config={"r": 16, "target_modules": ["q_proj", "v_proj"]},
quantization_type="4bit",
)
trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config=config)
trainer.train()
Features
1. Self-Healing
Automatic recovery from anomalies:
config = SelgisConfig(
nan_recovery=True, # Auto-rollback on NaN/Inf
spike_threshold=3.0, # Rollback when loss 3x spike
min_history_len=10, # Detection window
)
What happens:
- Loss becomes NaN — loads last stable state
- Loss spikes sharply — rollback + LR reduced 50%
- Optimizer momentum cleared
- Training continues
2. Memory Optimization
Techniques for large models on small GPUs:
| Technique | Savings |
|---|---|
| 4-bit quantization | 75% |
| CPU offload | 40% |
| Gradient checkpointing | 40% |
| LoRA (trainable only) | 99.9% |
| Unsloth | 50% less VRAM, 2x faster |
config = TransformerConfig(
quantization_type="4bit",
cpu_offload=True,
gradient_checkpointing=True,
use_peft=True,
peft_config={"r": 16},
)
2.1 Unsloth (NEW)
~2x faster training with ~50% less VRAM:
config = TransformerConfig(
model_name_or_path="Qwen/Qwen2-0.5B",
use_unsloth=True,
use_peft=True,
peft_config={"r": 16},
)
Works with: Llama, Qwen, Mistral, Phi, Gemma, Gemma 4.
3. Final Surge
Automatic plateau escape:
config = SelgisConfig(
patience=5, # epochs without improvement
final_surge_factor=5.0, # LR boost multiplier
)
If 5 epochs no improvement — LR multiplies to escape local minima.
4. LR Finder
Automatic learning rate search:
config = SelgisConfig(
lr_finder_enabled=True,
lr_finder_steps=100,
lr_finder_start=1e-7,
lr_finder_end=1.0,
)
Leslie Smith style — finds optimal LR in 100 steps.
5. Schedulers
Built-in schedulers:
config = SelgisConfig(
scheduler_type="cosine_restart", # cosine, linear, polynomial, constant
warmup_ratio=0.1,
min_lr=1e-7,
t_0=10,
t_mult=2,
)
6. Mixed Precision
config = SelgisConfig(
fp16=True, # FP16 mixed precision
# bf16=True, # or BF16 for Ampere+
)
7. Gradient Management
config = SelgisConfig(
grad_clip_norm=1.0,
# grad_clip_value=0.5,
gradient_accumulation_steps=4,
)
8. Checkpointing
config = SelgisConfig(
output_dir="./output",
save_best_only=True,
save_total_limit=3,
state_storage="disk", # or "memory"
)
9. Callbacks
Extend functionality:
from selgis import (
LoggingCallback,
EarlyStoppingCallback,
CheckpointCallback,
HistoryCallback,
WandBCallback,
SparsityCallback,
)
callbacks = [
LoggingCallback(log_every=10),
CheckpointCallback(output_dir="./checkpoints"),
EarlyStoppingCallback(patience=5, metric="accuracy", mode="max"),
WandBCallback(project="my-project"),
]
10. Datasets
Unified data API:
from selgis import create_dataloaders, DatasetConfig
# Text (JSONL) - auto-detects format by extension
config = DatasetConfig(
data_type="text",
data_path="./data.jsonl", # .jsonl, .json, .csv, .txt
max_length=512,
)
# Chat datasets - auto-detects alpaca/sharegpt/messages
config = DatasetConfig(
data_type="text",
data_path="./alpaca_data.jsonl", # auto-detects: alpaca, sharegpt, messages
)
# or manually:
config = DatasetConfig(
data_type="text",
data_path="./chat.jsonl",
chat_format="messages",
user_role="user", # custom role (default)
assistant_role="assistant",
)
# HuggingFace datasets
config = DatasetConfig(
data_type="text",
data_path="tatsu-lab/alpaca", # auto-downloads from HF
)
# Image
config = DatasetConfig(
data_type="image",
data_path="./images",
)
# Streaming (large files)
config = DatasetConfig(
data_type="streaming",
data_path="./large.jsonl",
buffer_size=1000,
)
train_loader, eval_loader = create_dataloaders(config)
CLI
# Demo mode
selgis train
# From config
selgis train --config config.yaml
# Check device
selgis device
# Run tests
selgis test
Configuration
| Parameter | Default | Description |
|---|---|---|
max_epochs |
100 | Max epochs |
learning_rate |
1e-3 | Base LR |
batch_size |
32 | Batch size |
nan_recovery |
True | Auto-rollback |
spike_threshold |
3.0 | Spike detection |
grad_clip_norm |
1.0 | Gradient clip |
save_best_only |
True | Save best only |
cpu_offload |
False | CPU optimizer |
final_surge_factor |
5.0 | LR boost on plateau |
Examples
Full examples: example_selgis.py
# Basic
from selgis import Trainer, SelgisConfig
config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, loader)
trainer.train()
# LoRA
from selgis import TransformerTrainer, TransformerConfig
config = TransformerConfig(model_name_or_path="Qwen/Qwen2-0.5B", use_peft=True)
trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config)
trainer.train()
# Callbacks
from selgis import LoggingCallback, CheckpointCallback
trainer = Trainer(model, config, loader, callbacks=[
LoggingCallback(log_every=10),
CheckpointCallback(output_dir="./ckpt"),
])
Dependencies
# Base
torch>=2.0, numpy>=1.20, tqdm
# Optional
transformers>=4.30, datasets, accelerate>=0.21.0
peft>=0.5.0
bitsandbytes>=0.41.0
wandb
pytest
Limitations
- DeepSpeed — partial support (v0.3.0)
- FSDP — in development
Future Plans
-
Unsloth integration — DONE (v0.2.6)
- 2x faster training, 50% less VRAM
- Llama, Qwen, Mistral, Phi, Gemma, Gemma 4 support
- Run locally or from HuggingFace
-
DeepSpeed full — complete ZeRO, pipeline
-
FSDP — Fully Sharded Data Parallel
-
Distributed Training — DDP, multi-GPU
-
More schedulers — OneCycle, ReduceLROnPlateau
-
MLflow integration — W&B alternative
Links
License: Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file selgis-0.2.7.1.tar.gz.
File metadata
- Download URL: selgis-0.2.7.1.tar.gz
- Upload date:
- Size: 64.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05164608d0518d2d47a5501c19513bf15793fbd1f543b9106631df0b247b9735
|
|
| MD5 |
163c624551b7fca8b98e2c5b2332abbd
|
|
| BLAKE2b-256 |
a78fa4346eac171dc2c630f54ffca59082c2859fa15b437d89ec0fe27034ed96
|
File details
Details for the file selgis-0.2.7.1-py3-none-any.whl.
File metadata
- Download URL: selgis-0.2.7.1-py3-none-any.whl
- Upload date:
- Size: 72.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d332d84d8af003b773a4d65e12c256d39b3faeb1a2b80107e4dd05139b01ade
|
|
| MD5 |
7250486d3566281e5b55b7d183b84773
|
|
| BLAKE2b-256 |
d046ec1cb72a908f9fd517862209ab49d1e1777b5a1610be2bd0861ad882e0cc
|