Skip to main content

Toy GPT next-token prediction using a bigram model.

Project description

Toy-GPT: train-200-bigram

PyPI version Latest Release Docs License: MIT CI Deploy-Docs Check Links Dependabot

Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

  • Training repositories produce pretrained artifacts (vocabulary, weights, metadata).
  • A web app loads the artifacts and provides an interactive prompt.

Contents

  • a small, declared text corpus
  • a tokenizer and vocabulary builder
  • a simple next-token prediction model
  • a repeatable training loop
  • committed, inspectable artifacts for downstream use

Scope

This is:

  • an intentionally inspectable training pipeline
  • a next-token predictor trained on an explicit corpus

This is not:

  • a production system
  • a full Transformer implementation
  • a chat interface
  • a claim of semantic understanding

Outputs

This repository produces and commits pretrained artifacts under artifacts/.

Training logs and evidence are written under outputs/ (for example, outputs/train_log.csv).

Quick start

See SETUP.md for full setup and workflow instructions.

Run the full training script:

uv run python src/toy_gpt_train/d_train.py

Run individually:

  • a/b/c are demos (can be run alone if desired)
  • d_train produces artifacts
  • e_infer consumes artifacts
uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py

Provenance and Purpose

The primary corpus used for training is declared in SE_MANIFEST.toml.

This repository commits pretrained artifacts so the client can run without retraining.

Annotations

ANNOTATIONS.md - REQ/WHY/OBS annotations used

Citation

CITATION.cff

License

MIT

SE Manifest

SE_MANIFEST.toml - project intent, scope, and role

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toy_gpt_train_200_bigram-0.9.6.tar.gz (81.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

toy_gpt_train_200_bigram-0.9.6-py3-none-any.whl (24.3 kB view details)

Uploaded Python 3

File details

Details for the file toy_gpt_train_200_bigram-0.9.6.tar.gz.

File metadata

  • Download URL: toy_gpt_train_200_bigram-0.9.6.tar.gz
  • Upload date:
  • Size: 81.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for toy_gpt_train_200_bigram-0.9.6.tar.gz
Algorithm Hash digest
SHA256 7871438e8ffe0cf9d0f82392d20a68a0ab61c849b4255ca2cc0587f8d834dd07
MD5 904c2dd7071dc2b18d3c79da1c2b70cb
BLAKE2b-256 e6181b3a2b245bbc5e59c8f399081a1be94e6d6d5d1dd369e94246db9712ba95

See more details on using hashes here.

Provenance

The following attestation bundles were made for toy_gpt_train_200_bigram-0.9.6.tar.gz:

Publisher: release-pypi-mkdocs-shared.yml on toy-gpt/train-200-bigram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file toy_gpt_train_200_bigram-0.9.6-py3-none-any.whl.

File metadata

File hashes

Hashes for toy_gpt_train_200_bigram-0.9.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6715d9e421b4293e4c7e40a7127cd8bc680daf75de2af34a7415e160c9b88027
MD5 410aecc3680ed698088a744b3956b6f8
BLAKE2b-256 f72923a71f67b9c60780e5e8dd0862bbf011f2c3cbb087a596964ae5ef64a3d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for toy_gpt_train_200_bigram-0.9.6-py3-none-any.whl:

Publisher: release-pypi-mkdocs-shared.yml on toy-gpt/train-200-bigram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page