Skip to main content

Universal Training Framework for PyTorch and HuggingFace Transformers

Project description

Selgis ML

Universal Training Framework for PyTorch and HuggingFace Transformers.

Selgis (Self-Guided Intelligent Stability) is a training framework with automatic failure protection.

PyPI version Python versions License


What is Selgis

03:47 — Training started.
07:00 — Loss: NaN. Training crashed.
07:01 — You realize: 8 hours of work are gone.

Neural network training is fragile. Loss spikes, NaN, OOM and plateaus can destroy hours of computation. Standard trainers log the error and stop — you debug and restart manually.

Selgis automatically:

  • Detects anomalies (NaN, spikes)
  • Rolls back to stable state
  • Lowers learning rate
  • Continues training without your intervention

Installation

# Base (PyTorch only)
pip install selgis

# Full (Transformers, LoRA, quantization)
pip install "selgis[all]"

Quick Start

Any PyTorch model

from selgis import Trainer, SelgisConfig
from torch.utils.data import DataLoader

config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, train_dataloader)
trainer.train()

LLM with LoRA

from selgis import TransformerTrainer, TransformerConfig

config = TransformerConfig(
    model_name_or_path="Qwen/Qwen2-0.5B",
    use_peft=True,
    peft_config={"r": 16, "target_modules": ["q_proj", "v_proj"]},
    quantization_type="4bit",
)

trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config=config)
trainer.train()

Features

1. Self-Healing

Automatic recovery from anomalies:

config = SelgisConfig(
    nan_recovery=True,        # Auto-rollback on NaN/Inf
    spike_threshold=3.0,     # Rollback when loss 3x spike
    min_history_len=10,      # Detection window
)

What happens:

  1. Loss becomes NaN — loads last stable state
  2. Loss spikes sharply — rollback + LR reduced 50%
  3. Optimizer momentum cleared
  4. Training continues

2. Memory Optimization

Techniques for large models on small GPUs:

Technique Savings
4-bit quantization 75%
CPU offload 40%
Gradient checkpointing 40%
LoRA (trainable only) 99.9%
config = TransformerConfig(
    quantization_type="4bit",
    cpu_offload=True,
    gradient_checkpointing=True,
    use_peft=True,
    peft_config={"r": 16},
)

3. Final Surge

Automatic plateau escape:

config = SelgisConfig(
    patience=5,               # epochs without improvement
    final_surge_factor=5.0,   # LR boost multiplier
)

If 5 epochs no improvement — LR multiplies to escape local minima.

4. LR Finder

Automatic learning rate search:

config = SelgisConfig(
    lr_finder_enabled=True,
    lr_finder_steps=100,
    lr_finder_start=1e-7,
    lr_finder_end=1.0,
)

Leslie Smith style — finds optimal LR in 100 steps.

5. Schedulers

Built-in schedulers:

config = SelgisConfig(
    scheduler_type="cosine_restart",  # cosine, linear, polynomial, constant
    warmup_ratio=0.1,
    min_lr=1e-7,
    t_0=10,
    t_mult=2,
)

6. Mixed Precision

config = SelgisConfig(
    fp16=True,   # FP16 mixed precision
    # bf16=True, # or BF16 for Ampere+
)

7. Gradient Management

config = SelgisConfig(
    grad_clip_norm=1.0,
    # grad_clip_value=0.5,
    gradient_accumulation_steps=4,
)

8. Checkpointing

config = SelgisConfig(
    output_dir="./output",
    save_best_only=True,
    save_total_limit=3,
    state_storage="disk",     # or "memory"
)

9. Callbacks

Extend functionality:

from selgis import (
    LoggingCallback,
    EarlyStoppingCallback,
    CheckpointCallback,
    HistoryCallback,
    WandBCallback,
    SparsityCallback,
)

callbacks = [
    LoggingCallback(log_every=10),
    CheckpointCallback(output_dir="./checkpoints"),
    EarlyStoppingCallback(patience=5, metric="accuracy", mode="max"),
    WandBCallback(project="my-project"),
]

10. Datasets

Unified data API:

from selgis import create_dataloaders, DatasetConfig

# Text (JSONL)
config = DatasetConfig(
    data_type="text",
    data_path="./data.jsonl",
    max_length=512,
)

# Image
config = DatasetConfig(
    data_type="image",
    data_path="./images",
)

# Streaming (large files)
config = DatasetConfig(
    data_type="streaming",
    data_path="./large.jsonl",
    buffer_size=1000,
)

train_loader, eval_loader = create_dataloaders(config)

CLI

# Demo mode
selgis train

# From config
selgis train --config config.yaml

# Check device
selgis device

# Run tests
selgis test

Configuration

Parameter Default Description
max_epochs 100 Max epochs
learning_rate 1e-3 Base LR
batch_size 32 Batch size
nan_recovery True Auto-rollback
spike_threshold 3.0 Spike detection
grad_clip_norm 1.0 Gradient clip
save_best_only True Save best only
cpu_offload False CPU optimizer
final_surge_factor 5.0 LR boost on plateau

Examples

Full examples: example_selgis.py

# Basic
from selgis import Trainer, SelgisConfig
config = SelgisConfig(max_epochs=10)
trainer = Trainer(model, config, loader)
trainer.train()

# LoRA
from selgis import TransformerTrainer, TransformerConfig
config = TransformerConfig(model_name_or_path="Qwen/Qwen2-0.5B", use_peft=True)
trainer = TransformerTrainer("Qwen/Qwen2-0.5B", config)
trainer.train()

# Callbacks
from selgis import LoggingCallback, CheckpointCallback
trainer = Trainer(model, config, loader, callbacks=[
    LoggingCallback(log_every=10),
    CheckpointCallback(output_dir="./ckpt"),
])

Dependencies

# Base
torch>=2.0, numpy>=1.20, tqdm

# Optional
transformers>=4.30, datasets, accelerate>=0.21.0
peft>=0.5.0
bitsandbytes>=0.41.0
wandb
pytest

Limitations

  • DeepSpeed — partial support (v0.3.0)
  • FSDP — in development

Future Plans

  • Unsloth integration — priority

    • 2x faster training, 50% less VRAM
    • LoRA, QLoRA, RopeScaling
    • Llama, Qwen, Mistral, Phi support
  • DeepSpeed full — complete ZeRO, pipeline

  • FSDP — Fully Sharded Data Parallel

  • Distributed Training — DDP, multi-GPU

  • More schedulers — OneCycle, ReduceLROnPlateau

  • Streaming datasets — petabyte-scale

  • MLflow integration — W&B alternative


Links


License: Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selgis-0.2.5.tar.gz (61.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selgis-0.2.5-py3-none-any.whl (69.5 kB view details)

Uploaded Python 3

File details

Details for the file selgis-0.2.5.tar.gz.

File metadata

  • Download URL: selgis-0.2.5.tar.gz
  • Upload date:
  • Size: 61.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for selgis-0.2.5.tar.gz
Algorithm Hash digest
SHA256 fa4159cc9cba62bfa5376b76717935b0ac0330da0294269c81a3e692450d935a
MD5 dbb63d58838f98b19bd441a62a559017
BLAKE2b-256 c3baefd47e498ead00bf7fc38dd7a94b2ae4ad06537fe32b522eaa649c0bda49

See more details on using hashes here.

File details

Details for the file selgis-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: selgis-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 69.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for selgis-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b1d844219d59223dca8e5ba1d734d556fe8c975fd676133a37d6c5e446e990e7
MD5 72b0f8d06aa2c319dbd55eca51b4d258
BLAKE2b-256 38d76279e7a01d77d27f082f36a461f8ab1d1196dd905e13384ec0a9fe11b34f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page