A clean, modular framework for training large language models with modern PyTorch features
Project description
Optimus-DL
Optimus-DL is a modular, high-performance research framework for training Large Language Models (LLMs) and other deep learning models. It leverages modern PyTorch features (AMP, DDP, Compile) and a flexible, composition-based architecture.
Key Features
- Modular "Recipe" Architecture: Clean separation between model definitions, data pipelines, and training logic.
- Hydra-based Configuration: Hierarchical, type-safe, and easily conveniently override-able configurations.
- Universal Metrics System: Lazy evaluation and automatic distributed aggregation of metrics.
- Unified Logging: First-class support for Weights & Biases, MLflow (with async and system metrics), and JSONL.
- Modern PyTorch: Built-in support for Mixed Precision (AMP), FSDP2, Tensor Parallelism, Sequence Parallelism, and
torch.compile. - Efficient Kernels: Integrated support for Liger-Kernel for memory-efficient and fast RMSNorm, SwiGLU, and CrossEntropy.
- Registry System: easy dependency injection and component swapping via a centralized registry.
The core idea of making everything modular and replacable is to make research experiments easy to implement cleanly.
Supported Models
Optimus-DL includes highly optimized implementations of:
- Llama 2 / 3: Full support for GQA, RoPE, and various sharding strategies.
- Qwen: Support for Qwen-style attention (Q/K Norm) and architectures.
- GPT-2: Classic architecture for baselining.
Quick Start
Installation
# Clone the repository
git clone <repository-url>
cd optimus-dl
# Install in editable mode with dependencies
pip install -e .
Training
Training is orchestrated via scripts/train.py using Hydra configs.
# Run with default configuration
python scripts/train.py
# Override specific parameters
python scripts/train.py model=gpt2 optimization.batch_size=64 common.use_gpu=true
# Your own config
python scripts/train.py --config-name=train_llama
Writing Train Configs
This project uses Hydra and OmegaConf for configuration management. Configurations are hierarchical and composable, allowing you to mix and match models, datasets, and training strategies.
Structure & Interpolation
Configs are located in configs/train/. A typical training config composes defaults (model, optimizer, scheduler) and then overrides specific parameters.
We use a special args section as a "scratch space" for high-level variables. These are referenced throughout the config using OmegaConf's interpolation syntax ${...}. This ensures consistency (e.g., setting seq_len in one place updates both the model and the data pipeline).
_name: base
args:
name: my-experiment
batch_size: 64
seq_len: 1024
# ... later in the config ...
optimization:
iterations: ${args.iterations}
data:
scratch:
base_transforms:
_name: compose
transforms:
# ...
- _name: flat_batcher
batch_size: ${args.batch_size} # Interpolated from args
seq_len: ${args.seq_len}
Data Pipelines & data.scratch
The data section typically defines train_datasets and eval_datasets. To avoid repeating complex transform chains, we define them in data.scratch and reference them via interpolation.
data:
scratch:
# Define the transform chain once
my_transform:
_name: compose
transforms:
- _name: tokenize
tokenizer_config: {_name: tiktoken, name: gpt2}
- _name: to_device
train_datasets:
source:
_name: loop
inner: {_name: preset_dataset, split: train}
# Reference the transform
transform: ${data.scratch.my_transform}
Hydra & Omegaconf Extra Quick Guide
Here are some power-user features you'll likely use:
- Overriding Defaults: You can swap out entire components from the command line.
# Switch the model to GPT-2 and optimizer to SGD
python scripts/train.py model=gpt2 optimization/optimizer=sgd
- Multirun (
-m): Run multiple experiments sequentially with a sweep.
# Run 3 experiments with different learning rates
python scripts/train.py -m optimization.optimizer.lr=1e-3,1e-4,1e-5
-
Interpolation: Reference other config values dynamically.
${layout.param}: Standard interpolation.${oc.env:VAR_NAME}: Read from environment variableVAR_NAME.${.relative_param}: Relative path interpolation.${eval:expression}: Evaluate a Python expression. For example,${eval:"'string' + '_suffix'"}or${eval:"int(100 * 0.5)"}. This is defined inoptimus_dl/core/omegaconf.py.
-
Debugging: See the resolved configuration without running the code.
# Print the full config structure
python scripts/train.py --config-name=train_llama -c job
Framework Internals
Understanding these core components is crucial for advanced usage and research extensions.
Registry System
The framework relies heavily on a registry pattern to decouple configuration from implementation. This allows you to swap components (models, optimizers, schedulers) just by changing the _name field in the config.
- Location:
optimus_dl/core/registry.py - Usage:
from optimus_dl.core.registry import make_registry
# Create a new registry
registry, register, build = make_registry("my_component")
@register("my_impl")
class MyImplementation:
def __init__(self, param): ...
# Build from config
obj = build(RegistryConfig(_name="my_impl", param=1))
Data Pipeline
Data loading is split into Sources and Transforms.
- Source: Yields raw items (e.g., text, examples).
- Transforms: A chain of operations (Tokenize -> Chunk -> Shuffle -> Batch -> ToDevice).
This design allows for highly reusable data processing pipelines. Complex transform chains are often defined in data.scratch and referenced in dataset configs.
Checkpointing
We use PyTorch's Distributed Checkpoint (DCP) API for efficient, sharded saving/loading of large models.
- Structure: Checkpoints are directories containing sharded tensor data and a metadata file.
- Manager:
CheckpointManagerhandles the complexity of saving model, optimizer, scheduler, and dataloader states. - Auto-Resume: The training loop automatically detects the latest checkpoint in the output directory and resumes from it.
LoadStrategy:
For fine-tuning or experiments, you might want to load only parts of a checkpoint. The LoadStrategy class (optimus_dl/modules/checkpoint/load_strategy.py) controls this.
load_model(bool): Load model weights.load_optimizer(bool): Load optimizer state.load_scheduler(bool): Load learning rate scheduler state.load_data_sources(bool): Load data source state (e.g. readers position).load_dataloaders(bool): Load full dataloader state.load_metrics(bool): Load accumulated metrics.load_iteration(bool): Resume iteration count.extra_ignore_keys(list): Specific keys to ignore in the checkpoint state dict.
Advanced Usage
Model Transforms
Optimus-DL applies transformations to the model after initialization but before training. This is where distributed wrappers and compilation happen.
- Config:
model_transformslist intrain.yaml. - Common Transforms:
ddp: Standard DistributedDataParallel.fully_shard: PyTorch FSDP2 (Fully Sharded Data Parallel). Supports mixed precision, CPU offloading, and mesh sharding.compile:torch.compilefor graph optimization.
model_transforms:
- _name: fully_shard
mixed_precision:
param_dtype: bfloat16
reduce_dtype: float32
- _name: compile
Evaluation with lm_eval
The framework integrates with the Language Model Evaluation Harness for standardized benchmarks.
- Script:
scripts/eval.py - Config:
configs/eval/default.yaml
# Evaluate a checkpoint on Hellaswag and MMLU
python scripts/eval.py \
common.checkpoint_path=outputs/my-run/checkpoint_00010000 \
lm_eval.tasks=[hellaswag,mmlu] \
lm_eval.batch_size=8
More advanced:
python scripts/eval.py --config-name quick_pretrained \
common.checkpoint_path=null ++common.model._name=preset_hfllama2 ++common.model.hf_model_name=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
lm_eval.tasks=[hellaswag,mmlu] \
lm_eval.batch_size=4
Serving Models
Optimus-DL provides a simple serving script for deploying trained models as an OpenAI-compatible API endpoint. This uses scripts/serve.py.
- Script:
scripts/serve.py - Config:
configs/serve/
# Serve a TinyLlama model
python scripts/serve.py --config-name=tinyllama
Make requests:
curl -X POST http://127.0.0.1:8000//v1/chat/completions \
-d '{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}], "max_tokens": 100, "temperature": 0.01}'
curl -X POST http://localhost:8000/v1/completions -d '{"prompt": "All:", "max_tokens": 50, "temperature": 0.01}'
Project Structure
optimus_dl/: Main package source code.core/: Fundamental utilities (logging, registry, device management).modules/: Pluggable components (models, optimizers, data loaders).recipe/: Orchestration logic (training loops, evaluation).
configs/: Hierarchical Hydra configuration files.scripts/: Entry points.
Development
The project enforces strict code quality standards.
# Run tests
pytest
# Format code
black .
isort .
ruff check --fix .
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file optimus_dl-0.1.5.tar.gz.
File metadata
- Download URL: optimus_dl-0.1.5.tar.gz
- Upload date:
- Size: 323.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6cd4e71090b6d510fd4c62bad0222210d8f17213df0f80e024f8e4dc748c0af
|
|
| MD5 |
fd11e42b243a16ae667d8c4d8d2bc6ac
|
|
| BLAKE2b-256 |
c344c4c36351ea7ef9a6865f05100993e00d20afe591e217ebda183c4a16e7a2
|
Provenance
The following attestation bundles were made for optimus_dl-0.1.5.tar.gz:
Publisher:
publish.yml on alexdremov/optimus-dl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
optimus_dl-0.1.5.tar.gz -
Subject digest:
a6cd4e71090b6d510fd4c62bad0222210d8f17213df0f80e024f8e4dc748c0af - Sigstore transparency entry: 1180953949
- Sigstore integration time:
-
Permalink:
alexdremov/optimus-dl@aa55958ba674e9d0666bb3273c06eb25b5eebe0a -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/alexdremov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aa55958ba674e9d0666bb3273c06eb25b5eebe0a -
Trigger Event:
release
-
Statement type:
File details
Details for the file optimus_dl-0.1.5-py3-none-any.whl.
File metadata
- Download URL: optimus_dl-0.1.5-py3-none-any.whl
- Upload date:
- Size: 253.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
830e423c9a173b4ad6c1b813e468e8a15c004c4358aeee7018c2a8c6528eccd4
|
|
| MD5 |
b71571699f0eb092f6b674432d5020b7
|
|
| BLAKE2b-256 |
d36b3cde0eac193ecae60207fa3a95e6c5ad03dc54a1c39d1a0892fe0c064828
|
Provenance
The following attestation bundles were made for optimus_dl-0.1.5-py3-none-any.whl:
Publisher:
publish.yml on alexdremov/optimus-dl
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
optimus_dl-0.1.5-py3-none-any.whl -
Subject digest:
830e423c9a173b4ad6c1b813e468e8a15c004c4358aeee7018c2a8c6528eccd4 - Sigstore transparency entry: 1180953957
- Sigstore integration time:
-
Permalink:
alexdremov/optimus-dl@aa55958ba674e9d0666bb3273c06eb25b5eebe0a -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/alexdremov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@aa55958ba674e9d0666bb3273c06eb25b5eebe0a -
Trigger Event:
release
-
Statement type: