Skip to main content

A clean, efficient framework for pretraining language models from scratch

Project description

GPT-Simple

CI PyPI Python 3.10+ License: MIT

A clean, efficient framework for pretraining language models from scratch.

GPT-Simple handles the full LLM pretraining workflow — tokenization, streaming data loading, multi-GPU training, checkpointing, and inference — through a single YAML config and a small CLI. It ships with a modern GPT architecture ready to train out of the box.

Features

  • Single YAML config + CLIinit / tokenize / train / status / stop / validate / generate / batch-generate.
  • Multi-GPU out of the box--nproc_per_node N launches torchrun automatically (Accelerate, bf16, torch.compile, gradient checkpointing).
  • Pretokenized streaming — memory-mapped .bin/.idx shards with sequence packing; a raw-JSONL fallback for quick experiments.
  • Deterministic stop/resume — walltime- and signal-aware checkpoints with topology-agnostic data cursors, so N short jobs equal one long job (every document seen exactly once, even if world_size / num_workers change between restarts).
  • Orchestrator-friendly — runs under SLURM, Kubernetes, or a local loop; templates in examples/orchestrators/.
  • Curriculum learning — phase-based mixing across named data buckets.
  • Modern architecture — pre-norm decoder with RoPE, RMSNorm, and a gated (SwiGLU) MLP; also expresses GQA/MQA, vanilla MLPs, and untied heads via config.
  • Python APIimport gpt_simple; gpt_simple.train(config="config.yaml").

Installation

pip install gpt-simple-lm              # from PyPI
pip install "gpt-simple-lm[wandb]"     # optional: Weights & Biases logging
pip install "gpt-simple-lm[rich]"      # optional: prettier (rich-formatted) CLI output

The distribution is named gpt-simple-lm; you import it as gpt_simple and run the gpt-simple CLI. All released versions are published on PyPI.

From source (development):

git clone https://github.com/lb-off/gpt-simple
cd gpt-simple
pip install -e ".[dev]"

Quick start

1. Generate a config

gpt-simple init -o config.yaml
gpt-simple init --preset small -o config.yaml    # ~125M  (small | medium | large)

2. Pretokenize your data

gpt-simple tokenize \
  --input_dir ./raw_data \
  --output_dir ./data/tokenized \
  --tokenizer_path gpt2 \
  --max_length 2048 \
  --num_workers 8

Converts .jsonl/.txt into memory-mapped .bin/.idx shards. See the data pipeline guide.

3. Train

gpt-simple train --config config.yaml                     # single GPU
gpt-simple train --config config.yaml --nproc_per_node 4  # 4 GPUs

# override any config value; start fresh with --force
gpt-simple train --config config.yaml --training.max_steps 5000 --force

See the training guide.

4. Monitor and control

gpt-simple status                 # training progress
gpt-simple stop                   # graceful shutdown (saves a checkpoint)
gpt-simple stop --force           # immediate SIGKILL

5. Generate

gpt-simple generate --output-dir ./outputs --prompt "Once upon a time" --max-new-tokens 200

--output-dir auto-picks the latest checkpoint. For multi-model / multi-sampling batches and a --dry-run submission gate, use batch-generate — see the inference guide.

Long runs with stop/resume

The trainer targets clusters with a hard per-job wall-clock cap. With resume: auto (the default), re-running the same command resumes the latest checkpoint, and the trainer saves and exits cleanly before a walltime deadline or on SIGTERM/SIGUSR1 — so an orchestrator just re-queues the job.

gpt-simple train --config config.yaml   # resume is automatic on every restart
gpt-simple status
gpt-simple stop                          # or let walltime/SIGUSR1 do it

Templates: slurm_resume_chain.sh, kubernetes_job.yaml, local_loop.sh. See the checkpointing & resume and orchestration guides.

Configuration

All settings live in one YAML file with four sections — model, data, optimizer, training:

model:
  n_embd: 768
  n_layer: 12
  n_head: 12
  n_positions: 2048

data:
  path: ./data/tokenized
  tokenizer: gpt2
  format: pretokenized       # pretokenized | jsonl
  max_length: 2048

optimizer:
  learning_rate: 3.0e-4
  warmup_steps: 100

training:
  per_device_batch_size: 4
  gradient_accumulation_steps: 4
  max_steps: 1000
  output_dir: ./outputs
  # wandb_project: my-project   # uncomment to enable W&B

gpt-simple init writes a fully commented template. Every field is documented in the configuration reference, and curriculum learning in the data pipeline guide.

Python API

import gpt_simple

result = gpt_simple.train(
    model=gpt_simple.ModelConfig(n_embd=768, n_layer=12, n_head=12),
    data=gpt_simple.DataConfig(path="./data/tokenized", tokenizer="gpt2"),
    optimizer=gpt_simple.OptimizerConfig(learning_rate=3e-4),
    training=gpt_simple.TrainingConfig(max_steps=1000, output_dir="./outputs"),
)
print(result.final_loss, result.total_tokens, result.checkpoint_path)

Or gpt_simple.train(config="config.yaml"); sub-configs passed explicitly override the matching section from the file.

Documentation

Full guides live in docs/:

Development

pip install -e ".[dev]"
pytest tests/
ruff check src/ tests/

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_simple_lm-0.1.1.tar.gz (171.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpt_simple_lm-0.1.1-py3-none-any.whl (133.4 kB view details)

Uploaded Python 3

File details

Details for the file gpt_simple_lm-0.1.1.tar.gz.

File metadata

  • Download URL: gpt_simple_lm-0.1.1.tar.gz
  • Upload date:
  • Size: 171.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpt_simple_lm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 16ff335b45c03c802bd1f24d03866e41b22aa40933cdef0f3fa935f77a629983
MD5 7a1d9b8df59c59d042f892823823bc90
BLAKE2b-256 b2d45be2c5d8ef09ca0c7400f2f684f5666a13ed1889e23d6136a6b366325162

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpt_simple_lm-0.1.1.tar.gz:

Publisher: publish.yml on lb-off/gpt-simple

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpt_simple_lm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gpt_simple_lm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 133.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpt_simple_lm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 19e59eda6141b7a8922f4d4ba313f091bdcce536371e066cceb90d39ce47335e
MD5 8ff06b82626e691c5e7cfe43eb3e8182
BLAKE2b-256 fd3fc5d51d696593100b40135ac8448555067d0c3e63b72975331800fa7ad2d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpt_simple_lm-0.1.1-py3-none-any.whl:

Publisher: publish.yml on lb-off/gpt-simple

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page