Skip to main content

Universal Training Framework for PyTorch and HuggingFace Transformers

Project description

Selgis ML

Universal Training Framework for PyTorch and HuggingFace Transformers.

Selgis (Self-Guided Intelligent Stability) is a training framework with automatic failure protection.

PyPI version Python versions License


What is Selgis

03:47 — Training started.
07:00 — Loss: NaN. Training crashed.
07:01 — You realize: 8 hours of work are gone.

Neural network training is fragile. Loss spikes, NaN, OOM and plateaus can destroy hours of computation. Standard trainers log the error and stop — you debug and restart manually.

Selgis automatically:

  • Detects anomalies (NaN, spikes)
  • Rolls back to stable state
  • Lowers learning rate
  • Continues training without your intervention

Installation

# Base (PyTorch only)
pip install selgis

# Full (Transformers, LoRA, quantization)
pip install "selgis[all]"

# Unsloth support (recommended for LLM training)
pip install unsloth

Quick Start

Any PyTorch model

from selgis import Trainer, SelgisConfig
from torch.utils.data import DataLoader

config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, train_dataloader)
trainer.train()

LLM with LoRA

from selgis import TransformerTrainer, TransformerConfig

config = TransformerConfig(
    model_name_or_path="Qwen/Qwen2-0.5B",
    use_peft=True,
    peft_config={"r": 16, "target_modules": ["q_proj", "v_proj"]},
    quantization_type="4bit",
)

trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config=config)
trainer.train()

Features

1. Self-Healing

Automatic recovery from anomalies:

config = SelgisConfig(
    nan_recovery=True,        # Auto-rollback on NaN/Inf
    spike_threshold=3.0,     # Rollback when loss 3x spike
    min_history_len=10,      # Detection window
)

What happens:

  1. Loss becomes NaN — loads last stable state
  2. Loss spikes sharply — rollback + LR reduced 50%
  3. Optimizer momentum cleared
  4. Training continues

2. Memory Optimization

Techniques for large models on small GPUs:

Technique Savings
4-bit quantization 75%
CPU offload 40%
Gradient checkpointing 40%
LoRA (trainable only) 99.9%
Unsloth 50% less VRAM, 2x faster
config = TransformerConfig(
    quantization_type="4bit",
    cpu_offload=True,
    gradient_checkpointing=True,
    use_peft=True,
    peft_config={"r": 16},
)

2.1 Unsloth (NEW)

~2x faster training with ~50% less VRAM:

config = TransformerConfig(
    model_name_or_path="Qwen/Qwen2-0.5B",
    use_unsloth=True,
    use_peft=True,
    peft_config={"r": 16},
)

Works with: Llama, Qwen, Mistral, Phi, Gemma, Gemma 4.

3. Final Surge

Automatic plateau escape:

config = SelgisConfig(
    patience=5,               # epochs without improvement
    final_surge_factor=5.0,   # LR boost multiplier
)

If 5 epochs no improvement — LR multiplies to escape local minima.

4. LR Finder

Automatic learning rate search:

config = SelgisConfig(
    lr_finder_enabled=True,
    lr_finder_steps=100,
    lr_finder_start=1e-7,
    lr_finder_end=1.0,
)

Leslie Smith style — finds optimal LR in 100 steps.

5. Schedulers

Built-in schedulers:

config = SelgisConfig(
    scheduler_type="cosine_restart",  # cosine, linear, polynomial, constant
    warmup_ratio=0.1,
    min_lr=1e-7,
    t_0=10,
    t_mult=2,
)

6. Mixed Precision

config = SelgisConfig(
    fp16=True,   # FP16 mixed precision
    # bf16=True, # or BF16 for Ampere+
)

7. Gradient Management

config = SelgisConfig(
    grad_clip_norm=1.0,
    # grad_clip_value=0.5,
    gradient_accumulation_steps=4,
)

8. Checkpointing

config = SelgisConfig(
    output_dir="./output",
    save_best_only=True,
    save_total_limit=3,
    state_storage="disk",     # or "memory"
)

9. Callbacks

Extend functionality:

from selgis import (
    LoggingCallback,
    EarlyStoppingCallback,
    CheckpointCallback,
    HistoryCallback,
    WandBCallback,
    SparsityCallback,
)

callbacks = [
    LoggingCallback(log_every=10),
    CheckpointCallback(output_dir="./checkpoints"),
    EarlyStoppingCallback(patience=5, metric="accuracy", mode="max"),
    WandBCallback(project="my-project"),
]

10. Datasets

Unified data API:

from selgis import create_dataloaders, DatasetConfig

# Text (JSONL) - auto-detects format by extension
config = DatasetConfig(
    data_type="text",
    data_path="./data.jsonl",  # .jsonl, .json, .csv, .txt
    max_length=512,
)

# Chat datasets - auto-detects alpaca/sharegpt/messages
config = DatasetConfig(
    data_type="text",
    data_path="./alpaca_data.jsonl",  # auto-detects: alpaca, sharegpt, messages
)
# or manually:
config = DatasetConfig(
    data_type="text",
    data_path="./chat.jsonl",
    chat_format="messages",
    user_role="user",      # custom role (default)
    assistant_role="assistant",
)

# HuggingFace datasets
config = DatasetConfig(
    data_type="text",
    data_path="tatsu-lab/alpaca",  # auto-downloads from HF
)

# Image
config = DatasetConfig(
    data_type="image",
    data_path="./images",
)

# Streaming (large files)
config = DatasetConfig(
    data_type="streaming",
    data_path="./large.jsonl",
    buffer_size=1000,
)

train_loader, eval_loader = create_dataloaders(config)

CLI

# Demo mode
selgis train

# From config
selgis train --config config.yaml

# Check device
selgis device

# Run tests
selgis test

Configuration

Parameter Default Description
max_epochs 100 Max epochs
learning_rate 1e-3 Base LR
batch_size 32 Batch size
nan_recovery True Auto-rollback
spike_threshold 3.0 Spike detection
grad_clip_norm 1.0 Gradient clip
save_best_only True Save best only
cpu_offload False CPU optimizer
final_surge_factor 5.0 LR boost on plateau

Examples

Full examples: example_selgis.py

# Basic
from selgis import Trainer, SelgisConfig
config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, loader)
trainer.train()

# LoRA
from selgis import TransformerTrainer, TransformerConfig
config = TransformerConfig(model_name_or_path="Qwen/Qwen2-0.5B", use_peft=True)
trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config)
trainer.train()

# Callbacks
from selgis import LoggingCallback, CheckpointCallback
trainer = Trainer(model, config, loader, callbacks=[
    LoggingCallback(log_every=10),
    CheckpointCallback(output_dir="./ckpt"),
])

Dependencies

# Base
torch>=2.0, numpy>=1.20, tqdm

# Optional
transformers>=4.30, datasets, accelerate>=0.21.0
peft>=0.5.0
bitsandbytes>=0.41.0
wandb
pytest

Limitations

  • DeepSpeed — partial support (v0.3.0)
  • FSDP — in development

Future Plans

  • Unsloth integration — DONE (v0.2.6)

    • 2x faster training, 50% less VRAM
    • Llama, Qwen, Mistral, Phi, Gemma, Gemma 4 support
    • Run locally or from HuggingFace
  • DeepSpeed full — complete ZeRO, pipeline

  • FSDP — Fully Sharded Data Parallel

  • Distributed Training — DDP, multi-GPU

  • More schedulers — OneCycle, ReduceLROnPlateau

  • MLflow integration — W&B alternative


Links


License: Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selgis-0.2.7.tar.gz (64.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selgis-0.2.7-py3-none-any.whl (72.1 kB view details)

Uploaded Python 3

File details

Details for the file selgis-0.2.7.tar.gz.

File metadata

  • Download URL: selgis-0.2.7.tar.gz
  • Upload date:
  • Size: 64.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for selgis-0.2.7.tar.gz
Algorithm Hash digest
SHA256 19e8c08e748006df0e1078882fd0c5ac409ea96d264f60ce3c5f60d84d23b573
MD5 ef1f8db3ba65868c8dba796fd559dfa6
BLAKE2b-256 a0b68f2e54968f80579037dde4aab3472a58e796db8c2536c0b4e8d0814f4ce6

See more details on using hashes here.

File details

Details for the file selgis-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: selgis-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 72.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for selgis-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 fbb8eaf594c9bdf82b06a8b828aaf44cb83206e3f7327e9c1a2707ed6e117bea
MD5 03b1e99f2e9487694a346a5ecb399455
BLAKE2b-256 a7e75549eb80aab758765b56d37fe53368e49052e2b1cb2e5bcfbe6516e5a863

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page