Toy GPT next-token prediction using a unigram model.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

denisecase

These details have not been verified by PyPI

Project description

Toy-GPT: train-100-unigram

Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

Training repositories produce pretrained artifacts (vocabulary, weights, metadata).
A web app loads the artifacts and provides an interactive prompt.

a small, declared text corpus
a tokenizer and vocabulary builder
a simple next-token prediction model
a repeatable training loop
committed, inspectable artifacts for downstream use

Scope

This is:

an intentionally inspectable training pipeline
a next-token predictor trained on an explicit corpus

This is not:

a production system
a full Transformer implementation
a chat interface
a claim of semantic understanding

Outputs

This repository produces and commits pretrained artifacts under artifacts/.

Training logs and evidence are written under outputs/ (for example, outputs/train_log.csv).

Quick start

See SETUP.md for full setup and workflow instructions.

Run the full training script:

uv run python src/toy_gpt_train/d_train.py

Run individually:

a/b/c are demos (can be run alone if desired)
d_train produces artifacts
e_infer consumes artifacts

uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py

Provenance and Purpose

The primary corpus used for training is declared in SE_MANIFEST.toml.

This repository commits pretrained artifacts so the client can run without retraining.

Annotations

ANNOTATIONS.md - REQ/WHY/OBS annotations used

Citation

CITATION.cff

License

MIT

SE Manifest

SE_MANIFEST.toml - project intent, scope, and role

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

denisecase

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.9.8

Apr 3, 2026

0.9.7

Apr 3, 2026

0.9.6

Jan 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toy_gpt_train_100_unigram-0.9.8.tar.gz (84.3 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl (24.3 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file toy_gpt_train_100_unigram-0.9.8.tar.gz.

File metadata

Download URL: toy_gpt_train_100_unigram-0.9.8.tar.gz
Upload date: Apr 3, 2026
Size: 84.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for toy_gpt_train_100_unigram-0.9.8.tar.gz
Algorithm	Hash digest
SHA256	`49700c5caaa04f03efa16c90d32a962ecc0bbc1b808583ef5ea6abb31dbebb16`
MD5	`b5380c68d971e9da3fd242075d34c40b`
BLAKE2b-256	`ea7b8a5972b2899493b6fafb66fd1f136953072f5f3c1f30235af2245f2b97bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for toy_gpt_train_100_unigram-0.9.8.tar.gz:

Publisher: release-pypi-mkdocs-shared.yml on toy-gpt/train-100-unigram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: toy_gpt_train_100_unigram-0.9.8.tar.gz
- Subject digest: 49700c5caaa04f03efa16c90d32a962ecc0bbc1b808583ef5ea6abb31dbebb16
- Sigstore transparency entry: 1221983439
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: toy-gpt/train-100-unigram@407c3cd61d601ac84f500db6fe8649d80deab183
- Branch / Tag: refs/tags/v0.9.8
- Owner: https://github.com/toy-gpt
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi-mkdocs-shared.yml@407c3cd61d601ac84f500db6fe8649d80deab183
- Trigger Event: push

File details

Details for the file toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl.

File metadata

Download URL: toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 24.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09890c4c9bd3f8d4718417e30f5e429cada8513168fc911e808076401f7ffaf7`
MD5	`62d6af61e1af98634dc8bd62241fc31c`
BLAKE2b-256	`4bb951cfdfaa9668213bdf7aedf02ca1a4804f4fae75e74cc5b6b173968afbe6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl:

Publisher: release-pypi-mkdocs-shared.yml on toy-gpt/train-100-unigram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: toy_gpt_train_100_unigram-0.9.8-py3-none-any.whl
- Subject digest: 09890c4c9bd3f8d4718417e30f5e429cada8513168fc911e808076401f7ffaf7
- Sigstore transparency entry: 1221983441
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: toy-gpt/train-100-unigram@407c3cd61d601ac84f500db6fe8649d80deab183
- Branch / Tag: refs/tags/v0.9.8
- Owner: https://github.com/toy-gpt
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release-pypi-mkdocs-shared.yml@407c3cd61d601ac84f500db6fe8649d80deab183
- Trigger Event: push

toy-gpt-train-100-unigram 0.9.8

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Toy-GPT: train-100-unigram

Contents

Scope

Outputs

Quick start

Provenance and Purpose

Annotations

Citation

License

SE Manifest

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance