A clean, efficient framework for pretraining language models from scratch
Project description
GPT-Simple
A clean, efficient framework for pretraining language models from scratch.
GPT-Simple handles the full LLM pretraining workflow — tokenization, streaming data loading, multi-GPU training, checkpointing, and inference — through a single YAML config and a small CLI. It ships with a modern GPT architecture ready to train out of the box.
Features
- Single YAML config + CLI —
init/tokenize/train/status/stop/validate/generate/batch-generate. - Multi-GPU out of the box —
--nproc_per_node Nlaunchestorchrunautomatically (Accelerate, bf16,torch.compile, gradient checkpointing). - Pretokenized streaming — memory-mapped
.bin/.idxshards with sequence packing; a raw-JSONL fallback for quick experiments. - Deterministic stop/resume — walltime- and signal-aware checkpoints
with topology-agnostic data cursors, so N short jobs equal one long
job (every document seen exactly once, even if
world_size/num_workerschange between restarts). - Orchestrator-friendly — runs under SLURM, Kubernetes, or a local
loop; templates in
examples/orchestrators/. - Curriculum learning — phase-based mixing across named data buckets.
- Modern architecture — pre-norm decoder with RoPE, RMSNorm, and a gated (SwiGLU) MLP; also expresses GQA/MQA, vanilla MLPs, and untied heads via config.
- Python API —
import gpt_simple; gpt_simple.train(config="config.yaml").
Installation
pip install gpt-simple-lm # from PyPI
pip install "gpt-simple-lm[wandb]" # optional: Weights & Biases logging
pip install "gpt-simple-lm[rich]" # optional: prettier (rich-formatted) CLI output
The distribution is named gpt-simple-lm; you import it as gpt_simple and
run the gpt-simple CLI. All released versions are published on
PyPI.
From source (development):
git clone https://github.com/lb-off/gpt-simple
cd gpt-simple
pip install -e ".[dev]"
Quick start
1. Generate a config
gpt-simple init -o config.yaml
gpt-simple init --preset small -o config.yaml # ~125M (small | medium | large)
2. Pretokenize your data
gpt-simple tokenize \
--input_dir ./raw_data \
--output_dir ./data/tokenized \
--tokenizer_path gpt2 \
--max_length 2048 \
--num_workers 8
Converts .jsonl/.txt into memory-mapped .bin/.idx shards. See the
data pipeline guide.
3. Train
gpt-simple train --config config.yaml # single GPU
gpt-simple train --config config.yaml --nproc_per_node 4 # 4 GPUs
# override any config value; start fresh with --force
gpt-simple train --config config.yaml --training.max_steps 5000 --force
See the training guide.
4. Monitor and control
gpt-simple status # training progress
gpt-simple stop # graceful shutdown (saves a checkpoint)
gpt-simple stop --force # immediate SIGKILL
5. Generate
gpt-simple generate --output-dir ./outputs --prompt "Once upon a time" --max-new-tokens 200
--output-dir auto-picks the latest checkpoint. For multi-model /
multi-sampling batches and a --dry-run submission gate, use
batch-generate — see the inference guide.
Long runs with stop/resume
The trainer targets clusters with a hard per-job wall-clock cap. With
resume: auto (the default), re-running the same command resumes the
latest checkpoint, and the trainer saves and exits cleanly before a
walltime deadline or on SIGTERM/SIGUSR1 — so an orchestrator just
re-queues the job.
gpt-simple train --config config.yaml # resume is automatic on every restart
gpt-simple status
gpt-simple stop # or let walltime/SIGUSR1 do it
Templates: slurm_resume_chain.sh,
kubernetes_job.yaml,
local_loop.sh. See the
checkpointing & resume and
orchestration guides.
Configuration
All settings live in one YAML file with four sections — model, data,
optimizer, training:
model:
n_embd: 768
n_layer: 12
n_head: 12
n_positions: 2048
data:
path: ./data/tokenized
tokenizer: gpt2
format: pretokenized # pretokenized | jsonl
max_length: 2048
optimizer:
learning_rate: 3.0e-4
warmup_steps: 100
training:
per_device_batch_size: 4
gradient_accumulation_steps: 4
max_steps: 1000
output_dir: ./outputs
# wandb_project: my-project # uncomment to enable W&B
gpt-simple init writes a fully commented template. Every field is
documented in the configuration reference, and
curriculum learning in the data pipeline guide.
Python API
import gpt_simple
result = gpt_simple.train(
model=gpt_simple.ModelConfig(n_embd=768, n_layer=12, n_head=12),
data=gpt_simple.DataConfig(path="./data/tokenized", tokenizer="gpt2"),
optimizer=gpt_simple.OptimizerConfig(learning_rate=3e-4),
training=gpt_simple.TrainingConfig(max_steps=1000, output_dir="./outputs"),
)
print(result.final_loss, result.total_tokens, result.checkpoint_path)
Or gpt_simple.train(config="config.yaml"); sub-configs passed
explicitly override the matching section from the file.
Documentation
Full guides live in docs/:
- Architecture — the built-in model.
- Configuration — every config field.
- Data pipeline — tokenization, packing, curriculum.
- Training — multi-GPU, precision, compile.
- Checkpointing & resume — the stop/resume model.
- Orchestration — running under any scheduler.
- Inference —
generate/batch-generate. - Hardware tuning — peak GPU throughput.
- Performance — measured 2.8B throughput and MFU.
Development
pip install -e ".[dev]"
pytest tests/
ruff check src/ tests/
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gpt_simple_lm-0.1.1.tar.gz.
File metadata
- Download URL: gpt_simple_lm-0.1.1.tar.gz
- Upload date:
- Size: 171.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16ff335b45c03c802bd1f24d03866e41b22aa40933cdef0f3fa935f77a629983
|
|
| MD5 |
7a1d9b8df59c59d042f892823823bc90
|
|
| BLAKE2b-256 |
b2d45be2c5d8ef09ca0c7400f2f684f5666a13ed1889e23d6136a6b366325162
|
Provenance
The following attestation bundles were made for gpt_simple_lm-0.1.1.tar.gz:
Publisher:
publish.yml on lb-off/gpt-simple
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpt_simple_lm-0.1.1.tar.gz -
Subject digest:
16ff335b45c03c802bd1f24d03866e41b22aa40933cdef0f3fa935f77a629983 - Sigstore transparency entry: 1924995232
- Sigstore integration time:
-
Permalink:
lb-off/gpt-simple@9b0094a2a57e132b7f44d3fd139ad02fc6a58561 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/lb-off
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9b0094a2a57e132b7f44d3fd139ad02fc6a58561 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gpt_simple_lm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: gpt_simple_lm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 133.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19e59eda6141b7a8922f4d4ba313f091bdcce536371e066cceb90d39ce47335e
|
|
| MD5 |
8ff06b82626e691c5e7cfe43eb3e8182
|
|
| BLAKE2b-256 |
fd3fc5d51d696593100b40135ac8448555067d0c3e63b72975331800fa7ad2d2
|
Provenance
The following attestation bundles were made for gpt_simple_lm-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on lb-off/gpt-simple
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gpt_simple_lm-0.1.1-py3-none-any.whl -
Subject digest:
19e59eda6141b7a8922f4d4ba313f091bdcce536371e066cceb90d39ce47335e - Sigstore transparency entry: 1924995353
- Sigstore integration time:
-
Permalink:
lb-off/gpt-simple@9b0094a2a57e132b7f44d3fd139ad02fc6a58561 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/lb-off
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9b0094a2a57e132b7f44d3fd139ad02fc6a58561 -
Trigger Event:
release
-
Statement type: