Skip to main content

A clean, modular framework for training large language models with modern PyTorch features

Project description

Optimus-DL

Optimus-DL is a modular, high-performance research framework for training Large Language Models (LLMs) and other deep learning models. It leverages modern PyTorch features (AMP, DDP, Compile) and a flexible, composition-based architecture.

Key Features

  • Modular "Recipe" Architecture: Clean separation between model definitions, data pipelines, and training logic.
  • Hydra-based Configuration: Hierarchical, type-safe, and easily conveniently override-able configurations.
  • Universal Metrics System: Lazy evaluation and automatic distributed aggregation of metrics.
  • Modern PyTorch: Built-in support for Mixed Precision (AMP), Distributed Data Parallel (DDP), and torch.compile.
  • Registry System: easy dependency injection and component swapping via a centralized registry.

The core idea of making everything modular and replacable is to make research experiments easy to implement cleanly.

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd optimus-dl

# Install in editable mode with dependencies
pip install -e .

Training

Training is orchestrated via scripts/train.py using Hydra configs.

# Run with default configuration
python scripts/train.py

# Override specific parameters
python scripts/train.py model=gpt2 optimization.batch_size=64 common.use_gpu=true

# Your own config
python scripts/train.py --config-name=train_llama

Writing Train Configs

This project uses Hydra and OmegaConf for configuration management. Configurations are hierarchical and composable, allowing you to mix and match models, datasets, and training strategies.

Structure & Interpolation

Configs are located in configs/train/. A typical training config composes defaults (model, optimizer, scheduler) and then overrides specific parameters.

We use a special args section as a "scratch space" for high-level variables. These are referenced throughout the config using OmegaConf's interpolation syntax ${...}. This ensures consistency (e.g., setting seq_len in one place updates both the model and the data pipeline).

_name: base
args:
  name: my-experiment
  batch_size: 64
  seq_len: 1024

# ... later in the config ...
optimization:
  iterations: ${args.iterations}

data:
  scratch:
    base_transforms:
      _name: compose
      transforms:
        # ...
        - _name: flat_batcher
          batch_size: ${args.batch_size} # Interpolated from args
          seq_len: ${args.seq_len}

Data Pipelines & data.scratch

The data section typically defines train_datasets and eval_datasets. To avoid repeating complex transform chains, we define them in data.scratch and reference them via interpolation.

data:
  scratch:
    # Define the transform chain once
    my_transform:
      _name: compose
      transforms:
        - _name: tokenize
          tokenizer_config: {_name: tiktoken, name: gpt2}
        - _name: to_device

  train_datasets:
    source:
      _name: loop
      inner: {_name: preset_dataset, split: train}
    # Reference the transform
    transform: ${data.scratch.my_transform}

Hydra & Omegaconf Extra Quick Guide

Here are some power-user features you'll likely use:

  • Overriding Defaults: You can swap out entire components from the command line.

    # Switch the model to GPT-2 and optimizer to SGD
    python scripts/train.py model=gpt2 optimization/optimizer=sgd
    
  • Multirun (-m): Run multiple experiments sequentially with a sweep.

    # Run 3 experiments with different learning rates
    python scripts/train.py -m optimization.optimizer.lr=1e-3,1e-4,1e-5
    
  • Interpolation: Reference other config values dynamically.

    • ${layout.param}: Standard interpolation.
    • ${oc.env:VAR_NAME}: Read from environment variable VAR_NAME.
    • ${.relative_param}: Relative path interpolation.
    • ${eval:expression}: Evaluate a Python expression. For example, ${eval:"'string' + '_suffix'"} or ${eval:"int(100 * 0.5)"}. This is defined in optimus_dl/core/omegaconf.py.
  • Debugging: See the resolved configuration without running the code.

    # Print the full config structure
    python scripts/train.py --config-name=train_llama -c job
    

Framework Internals

Understanding these core components is crucial for advanced usage and research extensions.

Registry System

The framework relies heavily on a registry pattern to decouple configuration from implementation. This allows you to swap components (models, optimizers, schedulers) just by changing the _name field in the config.

  • Location: optimus_dl/core/registry.py
  • Usage:
    from optimus_dl.core.registry import make_registry
    
    # Create a new registry
    registry, register, build = make_registry("my_component")
    
    @register("my_impl")
    class MyImplementation:
        def __init__(self, param): ...
    
    # Build from config
    obj = build(RegistryConfig(_name="my_impl", param=1))
    

Data Pipeline

Data loading is split into Sources and Transforms.

  • Source: Yields raw items (e.g., text, examples).
  • Transforms: A chain of operations (Tokenize -> Chunk -> Shuffle -> Batch -> ToDevice).

This design allows for highly reusable data processing pipelines. Complex transform chains are often defined in data.scratch and referenced in dataset configs.

Checkpointing

We use PyTorch's Distributed Checkpoint (DCP) API for efficient, sharded saving/loading of large models.

  • Structure: Checkpoints are directories containing sharded tensor data and a metadata file.
  • Manager: CheckpointManager handles the complexity of saving model, optimizer, scheduler, and dataloader states.
  • Auto-Resume: The training loop automatically detects the latest checkpoint in the output directory and resumes from it.

LoadStrategy: For fine-tuning or experiments, you might want to load only parts of a checkpoint. The LoadStrategy class (optimus_dl/modules/checkpoint/load_strategy.py) controls this.

  • load_model (bool): Load model weights.
  • load_optimizer (bool): Load optimizer state.
  • load_scheduler (bool): Load learning rate scheduler state.
  • load_data_sources (bool): Load data source state (e.g. readers position).
  • load_dataloaders (bool): Load full dataloader state.
  • load_metrics (bool): Load accumulated metrics.
  • load_iteration (bool): Resume iteration count.
  • extra_ignore_keys (list): Specific keys to ignore in the checkpoint state dict.

Advanced Usage

Model Transforms

Optimus-DL applies transformations to the model after initialization but before training. This is where distributed wrappers and compilation happen.

  • Config: model_transforms list in train.yaml.
  • Common Transforms:
    • ddp: Standard DistributedDataParallel.
    • fully_shard: PyTorch FSDP2 (Fully Sharded Data Parallel). Supports mixed precision, CPU offloading, and mesh sharding.
    • compile: torch.compile for graph optimization.
model_transforms:
  - _name: fully_shard
    mixed_precision:
      param_dtype: bfloat16
      reduce_dtype: float32
  - _name: compile

Evaluation with lm_eval

The framework integrates with the Language Model Evaluation Harness for standardized benchmarks.

  • Script: scripts/eval.py
  • Config: configs/eval/default.yaml
# Evaluate a checkpoint on Hellaswag and MMLU
python scripts/eval.py \
    common.checkpoint_path=outputs/my-run/checkpoint_00010000 \
    lm_eval.tasks=[hellaswag,mmlu] \
    lm_eval.batch_size=8

More advanced:

python scripts/eval.py --config-name quick_pretrained \
          common.checkpoint_path=null ++common.model._name=preset_hfllama2 ++common.model.hf_model_name=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
          lm_eval.tasks=[hellaswag,mmlu] \
          lm_eval.batch_size=4

Serving Models

Optimus-DL provides a simple serving script for deploying trained models as an OpenAI-compatible API endpoint. This uses scripts/serve.py.

  • Script: scripts/serve.py
  • Config: configs/serve/
# Serve a TinyLlama model
python scripts/serve.py --config-name=tinyllama

Make requests:

curl -X POST http://127.0.0.1:8000//v1/chat/completions \
-d '{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}], "max_tokens": 100, "temperature": 0.01}'
curl -X POST http://localhost:8000/v1/completions -d '{"prompt": "All:", "max_tokens": 50, "temperature": 0.01}'

Project Structure

  • optimus_dl/: Main package source code.
    • core/: Fundamental utilities (logging, registry, device management).
    • modules/: Pluggable components (models, optimizers, data loaders).
    • recipe/: Orchestration logic (training loops, evaluation).
  • configs/: Hierarchical Hydra configuration files.
  • scripts/: Entry points.

Development

The project enforces strict code quality standards.

# Run tests
pytest

# Format code
black .
isort .
ruff check --fix .

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optimus_dl-0.0.5.tar.gz (160.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

optimus_dl-0.0.5-py3-none-any.whl (132.7 kB view details)

Uploaded Python 3

File details

Details for the file optimus_dl-0.0.5.tar.gz.

File metadata

  • Download URL: optimus_dl-0.0.5.tar.gz
  • Upload date:
  • Size: 160.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for optimus_dl-0.0.5.tar.gz
Algorithm Hash digest
SHA256 e97748e3dacca3cc22acbb9747d69389d4611c5f9282ffa39efc2fc237c9eb1c
MD5 3b8d2373bac316d8cbe14dafec38dd0b
BLAKE2b-256 f45ba4337c9d046a97113d5b75660b49e39dc37fa9ae2b2daea06e96135e5007

See more details on using hashes here.

Provenance

The following attestation bundles were made for optimus_dl-0.0.5.tar.gz:

Publisher: publish.yml on alexdremov/optimus-dl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file optimus_dl-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: optimus_dl-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 132.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for optimus_dl-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ec192e207f08e90c963fac048f6256dd15df3ab00717d07c1ba52abac04a391b
MD5 3cceff9e3c0026720a84b22f70f39e98
BLAKE2b-256 9902a3b925a00c29ca31a4e9ca74a7f77958f60bb124e71dcd5afbccf175b73f

See more details on using hashes here.

Provenance

The following attestation bundles were made for optimus_dl-0.0.5-py3-none-any.whl:

Publisher: publish.yml on alexdremov/optimus-dl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page