Skip to main content

Universal Training Framework for PyTorch and HuggingFace Transformers

Project description

Selgis ML

Universal Training Framework for PyTorch and HuggingFace Transformers.

Selgis (Self-Guided Intelligent Stability) is a training framework with automatic failure protection.

PyPI version Python versions License


What is Selgis

03:47 — Training started.
07:00 — Loss: NaN. Training crashed.
07:01 — You realize: 8 hours of work are gone.

Neural network training is fragile. Loss spikes, NaN, OOM and plateaus can destroy hours of computation. Standard trainers log the error and stop — you debug and restart manually.

Selgis automatically:

  • Detects anomalies (NaN, spikes)
  • Rolls back to stable state
  • Lowers learning rate
  • Continues training without your intervention

Installation

# Base (PyTorch only)
pip install selgis

# Full (Transformers, LoRA, quantization)
pip install "selgis[all]"

# Unsloth support (recommended for LLM training)
pip install unsloth

Quick Start

Any PyTorch model

from selgis import Trainer, SelgisConfig
from torch.utils.data import DataLoader

config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, train_dataloader)
trainer.train()

LLM with LoRA

from selgis import TransformerTrainer, TransformerConfig

config = TransformerConfig(
    model_name_or_path="Qwen/Qwen2-0.5B",
    use_peft=True,
    peft_config={"r": 16, "target_modules": ["q_proj", "v_proj"]},
    quantization_type="4bit",
)

trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config=config)
trainer.train()

Features

1. Self-Healing

Automatic recovery from anomalies:

config = SelgisConfig(
    nan_recovery=True,        # Auto-rollback on NaN/Inf
    spike_threshold=3.0,     # Rollback when loss 3x spike
    min_history_len=10,      # Detection window
)

What happens:

  1. Loss becomes NaN — loads last stable state
  2. Loss spikes sharply — rollback + LR reduced 50%
  3. Optimizer momentum cleared
  4. Training continues

2. Memory Optimization

Techniques for large models on small GPUs:

Technique Savings
4-bit quantization 75%
CPU offload 40%
Gradient checkpointing 40%
LoRA (trainable only) 99.9%
Unsloth 50% less VRAM, 2x faster
config = TransformerConfig(
    quantization_type="4bit",
    cpu_offload=True,
    gradient_checkpointing=True,
    use_peft=True,
    peft_config={"r": 16},
)

2.1 Unsloth (NEW)

~2x faster training with ~50% less VRAM:

config = TransformerConfig(
    model_name_or_path="Qwen/Qwen2-0.5B",
    use_unsloth=True,
    use_peft=True,
    peft_config={"r": 16},
)

Works with: Llama, Qwen, Mistral, Phi, Gemma, Gemma 4.

3. Final Surge

Automatic plateau escape:

config = SelgisConfig(
    patience=5,               # epochs without improvement
    final_surge_factor=5.0,   # LR boost multiplier
)

If 5 epochs no improvement — LR multiplies to escape local minima.

4. LR Finder

Automatic learning rate search:

config = SelgisConfig(
    lr_finder_enabled=True,
    lr_finder_steps=100,
    lr_finder_start=1e-7,
    lr_finder_end=1.0,
)

Leslie Smith style — finds optimal LR in 100 steps.

5. Schedulers

Built-in schedulers:

config = SelgisConfig(
    scheduler_type="cosine_restart",  # cosine, linear, polynomial, constant
    warmup_ratio=0.1,
    min_lr=1e-7,
    t_0=10,
    t_mult=2,
)

6. Mixed Precision

config = SelgisConfig(
    fp16=True,   # FP16 mixed precision
    # bf16=True, # or BF16 for Ampere+
)

7. Gradient Management

config = SelgisConfig(
    grad_clip_norm=1.0,
    # grad_clip_value=0.5,
    gradient_accumulation_steps=4,
)

8. Checkpointing

config = SelgisConfig(
    output_dir="./output",
    save_best_only=True,
    save_total_limit=3,
    state_storage="disk",     # or "memory"
)

9. Callbacks

Extend functionality:

from selgis import (
    LoggingCallback,
    EarlyStoppingCallback,
    CheckpointCallback,
    HistoryCallback,
    WandBCallback,
    SparsityCallback,
)

callbacks = [
    LoggingCallback(log_every=10),
    CheckpointCallback(output_dir="./checkpoints"),
    EarlyStoppingCallback(patience=5, metric="accuracy", mode="max"),
    WandBCallback(project="my-project"),
]

10. Datasets

Unified data API:

from selgis import create_dataloaders, DatasetConfig

# Text (JSONL) - auto-detects format by extension
config = DatasetConfig(
    data_type="text",
    data_path="./data.jsonl",  # .jsonl, .json, .csv, .txt
    max_length=512,
)

# Chat datasets - auto-detects alpaca/sharegpt/messages
config = DatasetConfig(
    data_type="text",
    data_path="./alpaca_data.jsonl",  # auto-detects: alpaca, sharegpt, messages
)
# or manually:
config = DatasetConfig(
    data_type="text",
    data_path="./chat.jsonl",
    chat_format="messages",
    user_role="user",      # custom role (default)
    assistant_role="assistant",
)

# HuggingFace datasets
config = DatasetConfig(
    data_type="text",
    data_path="tatsu-lab/alpaca",  # auto-downloads from HF
)

# Image
config = DatasetConfig(
    data_type="image",
    data_path="./images",
)

# Streaming (large files)
config = DatasetConfig(
    data_type="streaming",
    data_path="./large.jsonl",
    buffer_size=1000,
)

train_loader, eval_loader = create_dataloaders(config)

CLI

# Demo mode
selgis train

# From config
selgis train --config config.yaml

# Check device
selgis device

# Run tests
selgis test

Configuration

Parameter Default Description
max_epochs 100 Max epochs
learning_rate 1e-3 Base LR
batch_size 32 Batch size
nan_recovery True Auto-rollback
spike_threshold 3.0 Spike detection
grad_clip_norm 1.0 Gradient clip
save_best_only True Save best only
cpu_offload False CPU optimizer
final_surge_factor 5.0 LR boost on plateau

Examples

Full examples: example_selgis.py

# Basic
from selgis import Trainer, SelgisConfig
config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, loader)
trainer.train()

# LoRA
from selgis import TransformerTrainer, TransformerConfig
config = TransformerConfig(model_name_or_path="Qwen/Qwen2-0.5B", use_peft=True)
trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config)
trainer.train()

# Callbacks
from selgis import LoggingCallback, CheckpointCallback
trainer = Trainer(model, config, loader, callbacks=[
    LoggingCallback(log_every=10),
    CheckpointCallback(output_dir="./ckpt"),
])

Dependencies

# Base
torch>=2.0, numpy>=1.20, tqdm

# Optional
transformers>=4.30, datasets, accelerate>=0.21.0
peft>=0.5.0
bitsandbytes>=0.41.0
wandb
pytest

Limitations

  • DeepSpeed — partial support (v0.3.0)
  • FSDP — in development

Future Plans

  • Unsloth integration — DONE (v0.2.6)

    • 2x faster training, 50% less VRAM
    • Llama, Qwen, Mistral, Phi, Gemma, Gemma 4 support
    • Run locally or from HuggingFace
  • DeepSpeed full — complete ZeRO, pipeline

  • FSDP — Fully Sharded Data Parallel

  • Distributed Training — DDP, multi-GPU

  • More schedulers — OneCycle, ReduceLROnPlateau

  • MLflow integration — W&B alternative


Links


License: Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selgis-0.2.7.1.tar.gz (64.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selgis-0.2.7.1-py3-none-any.whl (72.3 kB view details)

Uploaded Python 3

File details

Details for the file selgis-0.2.7.1.tar.gz.

File metadata

  • Download URL: selgis-0.2.7.1.tar.gz
  • Upload date:
  • Size: 64.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for selgis-0.2.7.1.tar.gz
Algorithm Hash digest
SHA256 05164608d0518d2d47a5501c19513bf15793fbd1f543b9106631df0b247b9735
MD5 163c624551b7fca8b98e2c5b2332abbd
BLAKE2b-256 a78fa4346eac171dc2c630f54ffca59082c2859fa15b437d89ec0fe27034ed96

See more details on using hashes here.

File details

Details for the file selgis-0.2.7.1-py3-none-any.whl.

File metadata

  • Download URL: selgis-0.2.7.1-py3-none-any.whl
  • Upload date:
  • Size: 72.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for selgis-0.2.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9d332d84d8af003b773a4d65e12c256d39b3faeb1a2b80107e4dd05139b01ade
MD5 7250486d3566281e5b55b7d183b84773
BLAKE2b-256 d046ec1cb72a908f9fd517862209ab49d1e1777b5a1610be2bd0861ad882e0cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page