Skip to main content

A clean, efficient framework for pretraining language models from scratch

Project description

GPT-Simple

CI Python 3.10+ License: MIT

A clean, efficient framework for pretraining language models from scratch.

GPT-Simple handles the full LLM pretraining workflow — tokenization, streaming data loading, multi-GPU training, checkpointing, and inference — through a single YAML config and a small CLI. It ships with a modern GPT architecture ready to train out of the box.

Features

  • Single YAML config + CLIinit / tokenize / train / status / stop / validate / generate / batch-generate.
  • Multi-GPU out of the box--nproc_per_node N launches torchrun automatically (Accelerate, bf16, torch.compile, gradient checkpointing).
  • Pretokenized streaming — memory-mapped .bin/.idx shards with sequence packing; a raw-JSONL fallback for quick experiments.
  • Deterministic stop/resume — walltime- and signal-aware checkpoints with topology-agnostic data cursors, so N short jobs equal one long job (every document seen exactly once, even if world_size / num_workers change between restarts).
  • Orchestrator-friendly — runs under SLURM, Kubernetes, or a local loop; templates in examples/orchestrators/.
  • Curriculum learning — phase-based mixing across named data buckets.
  • Modern architecture — pre-norm decoder with RoPE, RMSNorm, and a gated (SwiGLU) MLP; also expresses GQA/MQA, vanilla MLPs, and untied heads via config.
  • Python APIimport gpt_simple; gpt_simple.train(config="config.yaml").

Installation

pip install -e ".[dev]"     # from source (development)
pip install .               # core only
pip install ".[wandb]"      # optional: Weights & Biases logging
pip install ".[cli]"        # optional: rich-formatted CLI output

Quick start

1. Generate a config

gpt-simple init -o config.yaml
gpt-simple init --preset small -o config.yaml    # ~125M  (small | medium | large)

2. Pretokenize your data

gpt-simple tokenize \
  --input_dir ./raw_data \
  --output_dir ./data/tokenized \
  --tokenizer_path gpt2 \
  --max_length 2048 \
  --num_workers 8

Converts .jsonl/.txt into memory-mapped .bin/.idx shards. See the data pipeline guide.

3. Train

gpt-simple train --config config.yaml                     # single GPU
gpt-simple train --config config.yaml --nproc_per_node 4  # 4 GPUs

# override any config value; start fresh with --force
gpt-simple train --config config.yaml --training.max_steps 5000 --force

See the training guide.

4. Monitor and control

gpt-simple status                 # training progress
gpt-simple stop                   # graceful shutdown (saves a checkpoint)
gpt-simple stop --force           # immediate SIGKILL

5. Generate

gpt-simple generate --output-dir ./outputs --prompt "Once upon a time" --max-new-tokens 200

--output-dir auto-picks the latest checkpoint. For multi-model / multi-sampling batches and a --dry-run submission gate, use batch-generate — see the inference guide.

Long runs with stop/resume

The trainer targets clusters with a hard per-job wall-clock cap. With resume: auto (the default), re-running the same command resumes the latest checkpoint, and the trainer saves and exits cleanly before a walltime deadline or on SIGTERM/SIGUSR1 — so an orchestrator just re-queues the job.

gpt-simple train --config config.yaml   # resume is automatic on every restart
gpt-simple status
gpt-simple stop                          # or let walltime/SIGUSR1 do it

Templates: slurm_resume_chain.sh, kubernetes_job.yaml, local_loop.sh. See the checkpointing & resume and orchestration guides.

Configuration

All settings live in one YAML file with four sections — model, data, optimizer, training:

model:
  n_embd: 768
  n_layer: 12
  n_head: 12
  n_positions: 2048

data:
  path: ./data/tokenized
  tokenizer: gpt2
  format: pretokenized       # pretokenized | jsonl
  max_length: 2048

optimizer:
  learning_rate: 3.0e-4
  warmup_steps: 100

training:
  per_device_batch_size: 4
  gradient_accumulation_steps: 4
  max_steps: 1000
  output_dir: ./outputs
  # wandb_project: my-project   # uncomment to enable W&B

gpt-simple init writes a fully commented template. Every field is documented in the configuration reference, and curriculum learning in the data pipeline guide.

Python API

import gpt_simple

result = gpt_simple.train(
    model=gpt_simple.ModelConfig(n_embd=768, n_layer=12, n_head=12),
    data=gpt_simple.DataConfig(path="./data/tokenized", tokenizer="gpt2"),
    optimizer=gpt_simple.OptimizerConfig(learning_rate=3e-4),
    training=gpt_simple.TrainingConfig(max_steps=1000, output_dir="./outputs"),
)
print(result.final_loss, result.total_tokens, result.checkpoint_path)

Or gpt_simple.train(config="config.yaml"); sub-configs passed explicitly override the matching section from the file.

Documentation

Full guides live in docs/:

Development

pip install -e ".[dev]"
pytest tests/
ruff check src/ tests/

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_simple_lm-0.1.0.tar.gz (171.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gpt_simple_lm-0.1.0-py3-none-any.whl (133.3 kB view details)

Uploaded Python 3

File details

Details for the file gpt_simple_lm-0.1.0.tar.gz.

File metadata

  • Download URL: gpt_simple_lm-0.1.0.tar.gz
  • Upload date:
  • Size: 171.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpt_simple_lm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8c5d0ee0e0d172871435ed5a7fce940f7acefbadd158635480735d9f904921f4
MD5 7e9bf7099cacf486dddb3fbcd9ab392e
BLAKE2b-256 89dc16af9beae32b32d96f46e42d8f4ba6434e36be69f0cb0ee8f4683f17b710

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpt_simple_lm-0.1.0.tar.gz:

Publisher: publish.yml on lb-off/gpt-simple

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gpt_simple_lm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gpt_simple_lm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 133.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpt_simple_lm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0772124cf0a9c3822c941dc0a258f0150c4712933ece88770cbd0f4c6e862a19
MD5 446d1bdfee49bb57e3c6d5b6f3956edb
BLAKE2b-256 3aface7d776cf5cf4e169d714941bafdb3b50e2b143560e1e0948715e2d033b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpt_simple_lm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on lb-off/gpt-simple

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page