A clean, efficient framework for pretraining language models from scratch

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lb-off

These details have not been verified by PyPI

Project description

GPT-Simple

A clean, efficient framework for pretraining language models from scratch.

GPT-Simple handles the full LLM pretraining workflow — tokenization, streaming data loading, multi-GPU training, checkpointing, and inference — through a single YAML config and a small CLI. It ships with a modern GPT architecture ready to train out of the box.

Features

Single YAML config + CLI — init / tokenize / train / status / stop / validate / generate / batch-generate.
Multi-GPU out of the box — --nproc_per_node N launches torchrun automatically (Accelerate, bf16, torch.compile, gradient checkpointing).
Pretokenized streaming — memory-mapped .bin/.idx shards with sequence packing; a raw-JSONL fallback for quick experiments.
Deterministic stop/resume — walltime- and signal-aware checkpoints with topology-agnostic data cursors, so N short jobs equal one long job (every document seen exactly once, even if world_size / num_workers change between restarts).
Orchestrator-friendly — runs under SLURM, Kubernetes, or a local loop; templates in examples/orchestrators/.
Curriculum learning — phase-based mixing across named data buckets.
Modern architecture — pre-norm decoder with RoPE, RMSNorm, and a gated (SwiGLU) MLP; also expresses GQA/MQA, vanilla MLPs, and untied heads via config.
Python API — import gpt_simple; gpt_simple.train(config="config.yaml").

Installation

pip install -e ".[dev]"     # from source (development)
pip install .               # core only
pip install ".[wandb]"      # optional: Weights & Biases logging
pip install ".[cli]"        # optional: rich-formatted CLI output

Quick start

1. Generate a config

gpt-simple init -o config.yaml
gpt-simple init --preset small -o config.yaml    # ~125M  (small | medium | large)

2. Pretokenize your data

gpt-simple tokenize \
  --input_dir ./raw_data \
  --output_dir ./data/tokenized \
  --tokenizer_path gpt2 \
  --max_length 2048 \
  --num_workers 8

Converts .jsonl/.txt into memory-mapped .bin/.idx shards. See the data pipeline guide.

3. Train

gpt-simple train --config config.yaml                     # single GPU
gpt-simple train --config config.yaml --nproc_per_node 4  # 4 GPUs

# override any config value; start fresh with --force
gpt-simple train --config config.yaml --training.max_steps 5000 --force

See the training guide.

4. Monitor and control

gpt-simple status                 # training progress
gpt-simple stop                   # graceful shutdown (saves a checkpoint)
gpt-simple stop --force           # immediate SIGKILL

5. Generate

gpt-simple generate --output-dir ./outputs --prompt "Once upon a time" --max-new-tokens 200

--output-dir auto-picks the latest checkpoint. For multi-model / multi-sampling batches and a --dry-run submission gate, use batch-generate — see the inference guide.

Long runs with stop/resume

The trainer targets clusters with a hard per-job wall-clock cap. With resume: auto (the default), re-running the same command resumes the latest checkpoint, and the trainer saves and exits cleanly before a walltime deadline or on SIGTERM/SIGUSR1 — so an orchestrator just re-queues the job.

gpt-simple train --config config.yaml   # resume is automatic on every restart
gpt-simple status
gpt-simple stop                          # or let walltime/SIGUSR1 do it

Templates: slurm_resume_chain.sh, kubernetes_job.yaml, local_loop.sh. See the checkpointing & resume and orchestration guides.

Configuration

All settings live in one YAML file with four sections — model, data, optimizer, training:

model:
  n_embd: 768
  n_layer: 12
  n_head: 12
  n_positions: 2048

data:
  path: ./data/tokenized
  tokenizer: gpt2
  format: pretokenized       # pretokenized | jsonl
  max_length: 2048

optimizer:
  learning_rate: 3.0e-4
  warmup_steps: 100

training:
  per_device_batch_size: 4
  gradient_accumulation_steps: 4
  max_steps: 1000
  output_dir: ./outputs
  # wandb_project: my-project   # uncomment to enable W&B

gpt-simple init writes a fully commented template. Every field is documented in the configuration reference, and curriculum learning in the data pipeline guide.

Python API

import gpt_simple

result = gpt_simple.train(
    model=gpt_simple.ModelConfig(n_embd=768, n_layer=12, n_head=12),
    data=gpt_simple.DataConfig(path="./data/tokenized", tokenizer="gpt2"),
    optimizer=gpt_simple.OptimizerConfig(learning_rate=3e-4),
    training=gpt_simple.TrainingConfig(max_steps=1000, output_dir="./outputs"),
)
print(result.final_loss, result.total_tokens, result.checkpoint_path)

Or gpt_simple.train(config="config.yaml"); sub-configs passed explicitly override the matching section from the file.

Documentation

Full guides live in docs/:

Architecture — the built-in model.
Configuration — every config field.
Data pipeline — tokenization, packing, curriculum.
Training — multi-GPU, precision, compile.
Checkpointing & resume — the stop/resume model.
Orchestration — running under any scheduler.
Inference — generate / batch-generate.
Hardware tuning — peak GPU throughput.
Performance — measured 2.8B throughput and MFU.

Development

pip install -e ".[dev]"
pytest tests/
ruff check src/ tests/

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lb-off

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jun 23, 2026

0.1.1

Jun 23, 2026

This version

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt_simple_lm-0.1.0.tar.gz (171.0 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gpt_simple_lm-0.1.0-py3-none-any.whl (133.3 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file gpt_simple_lm-0.1.0.tar.gz.

File metadata

Download URL: gpt_simple_lm-0.1.0.tar.gz
Upload date: Jun 23, 2026
Size: 171.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpt_simple_lm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8c5d0ee0e0d172871435ed5a7fce940f7acefbadd158635480735d9f904921f4`
MD5	`7e9bf7099cacf486dddb3fbcd9ab392e`
BLAKE2b-256	`89dc16af9beae32b32d96f46e42d8f4ba6434e36be69f0cb0ee8f4683f17b710`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpt_simple_lm-0.1.0.tar.gz:

Publisher: publish.yml on lb-off/gpt-simple

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpt_simple_lm-0.1.0.tar.gz
- Subject digest: 8c5d0ee0e0d172871435ed5a7fce940f7acefbadd158635480735d9f904921f4
- Sigstore transparency entry: 1924510042
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: lb-off/gpt-simple@dd2d4dfc8a7e7aed6b4ed60a27983e4b79de4f38
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/lb-off
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dd2d4dfc8a7e7aed6b4ed60a27983e4b79de4f38
- Trigger Event: release

File details

Details for the file gpt_simple_lm-0.1.0-py3-none-any.whl.

File metadata

Download URL: gpt_simple_lm-0.1.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 133.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gpt_simple_lm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0772124cf0a9c3822c941dc0a258f0150c4712933ece88770cbd0f4c6e862a19`
MD5	`446d1bdfee49bb57e3c6d5b6f3956edb`
BLAKE2b-256	`3aface7d776cf5cf4e169d714941bafdb3b50e2b143560e1e0948715e2d033b4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gpt_simple_lm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on lb-off/gpt-simple

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gpt_simple_lm-0.1.0-py3-none-any.whl
- Subject digest: 0772124cf0a9c3822c941dc0a258f0150c4712933ece88770cbd0f4c6e862a19
- Sigstore transparency entry: 1924510192
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: lb-off/gpt-simple@dd2d4dfc8a7e7aed6b4ed60a27983e4b79de4f38
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/lb-off
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@dd2d4dfc8a7e7aed6b4ed60a27983e4b79de4f38
- Trigger Event: release

gpt-simple-lm 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GPT-Simple

Features

Installation

Quick start

1. Generate a config

2. Pretokenize your data

3. Train

4. Monitor and control

5. Generate

Long runs with stop/resume

Configuration

Python API

Documentation

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance