Toy GPT next-token prediction using a unigram model.
Project description
Toy-GPT: train-100-unigram
Demonstrates, at very small scale, how a language model is trained.
This repository is part of a series of toy training repositories plus a companion client repository:
- Training repositories produce pretrained artifacts (vocabulary, weights, metadata).
- A web app loads the artifacts and provides an interactive prompt.
Contents
- a small, declared text corpus
- a tokenizer and vocabulary builder
- a simple next-token prediction model
- a repeatable training loop
- committed, inspectable artifacts for downstream use
Scope
This is:
- an intentionally inspectable training pipeline
- a next-token predictor trained on an explicit corpus
This is not:
- a production system
- a full Transformer implementation
- a chat interface
- a claim of semantic understanding
Outputs
This repository produces and commits pretrained artifacts under artifacts/.
Training logs and evidence are written under outputs/
(for example, outputs/train_log.csv).
Quick start
See SETUP.md for full setup and workflow instructions.
Run the full training script:
uv run python src/toy_gpt_train/d_train.py
Run individually:
- a/b/c are demos (can be run alone if desired)
- d_train produces artifacts
- e_infer consumes artifacts
uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py
Provenance and Purpose
The primary corpus used for training is declared in SE_MANIFEST.toml.
This repository commits pretrained artifacts so the client can run without retraining.
Annotations
ANNOTATIONS.md - REQ/WHY/OBS annotations used
Citation
License
SE Manifest
SE_MANIFEST.toml - project intent, scope, and role
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file toy_gpt_train_100_unigram-0.9.8.tar.gz.
File metadata
- Download URL: toy_gpt_train_100_unigram-0.9.8.tar.gz
- Upload date:
- Size: 84.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49700c5caaa04f03efa16c90d32a962ecc0bbc1b808583ef5ea6abb31dbebb16
|
|
| MD5 |
b5380c68d971e9da3fd242075d34c40b
|
|
| BLAKE2b-256 |
ea7b8a5972b2899493b6fafb66fd1f136953072f5f3c1f30235af2245f2b97bc
|
Provenance
The following attestation bundles were made for toy_gpt_train_100_unigram-0.9.8.tar.gz:
Publisher:
release-pypi-mkdocs-shared.yml on toy-gpt/train-100-unigram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
toy_gpt_train_100_unigram-0.9.8.tar.gz -
Subject digest:
49700c5caaa04f03efa16c90d32a962ecc0bbc1b808583ef5ea6abb31dbebb16 - Sigstore transparency entry: 1221983439
- Sigstore integration time:
-
Permalink:
toy-gpt/train-100-unigram@407c3cd61d601ac84f500db6fe8649d80deab183 -
Branch / Tag:
refs/tags/v0.9.8 - Owner: https://github.com/toy-gpt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-pypi-mkdocs-shared.yml@407c3cd61d601ac84f500db6fe8649d80deab183 -
Trigger Event:
push
-
Statement type:
File details
Details for the file toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl.
File metadata
- Download URL: toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl
- Upload date:
- Size: 24.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09890c4c9bd3f8d4718417e30f5e429cada8513168fc911e808076401f7ffaf7
|
|
| MD5 |
62d6af61e1af98634dc8bd62241fc31c
|
|
| BLAKE2b-256 |
4bb951cfdfaa9668213bdf7aedf02ca1a4804f4fae75e74cc5b6b173968afbe6
|
Provenance
The following attestation bundles were made for toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl:
Publisher:
release-pypi-mkdocs-shared.yml on toy-gpt/train-100-unigram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl -
Subject digest:
09890c4c9bd3f8d4718417e30f5e429cada8513168fc911e808076401f7ffaf7 - Sigstore transparency entry: 1221983441
- Sigstore integration time:
-
Permalink:
toy-gpt/train-100-unigram@407c3cd61d601ac84f500db6fe8649d80deab183 -
Branch / Tag:
refs/tags/v0.9.8 - Owner: https://github.com/toy-gpt
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-pypi-mkdocs-shared.yml@407c3cd61d601ac84f500db6fe8649d80deab183 -
Trigger Event:
push
-
Statement type: