Skip to main content

Building Language Models from Scratch — tokenizer, Transformer, training, inference, and fine-tuning in clean, tested PyTorch.

Project description

Building Language Models from Scratch

From the chain rule to a deployable model — no black boxes.

Instead of calling a library and treating the model as magic, you build every piece yourself: byte-level BPE, causal self-attention, the Transformer block, a training loop with warmup-cosine and diagnostics, a KV-cache inference engine, an evaluation harness, and a fine-tuning stack (SFT, LoRA, DPO). Each chapter is paired with a notebook that runs end-to-end on a single GPU and produces every figure the chapter cites — you see the real outputs, not idealized diagrams. The library is small (~5,000 LOC), inspectable, and backed by 190 tests; the GRU is verified against PyTorch's reference, and DPO comes with a proof.

Part of the "Beyond … and Pray" series: governed agents · trustworthy RAG · test & validate · LLMs from scratch

What's inside

  • Tokenization — character / word / byte + byte-level BPE from scratch
  • Transformers — attention, multi-head, RoPE / ALiBi / sinusoidal / learned, TinyGPT
  • TrainingTrainer, scaling laws, mixed precision, diagnostics
  • Inference — greedy / beam / top-k / top-p + a KV-cache engine
  • Evaluation — perplexity, calibration, LLM-as-judge, contamination checks
  • Fine-tuning — SFT (loss-masked), LoRA from scratch, DPO with a proof
  • 190 tests, 33 notebooks, one per chapter

Install

pip install llm-from-scratch              # core: torch, numpy, scikit-learn
pip install "llm-from-scratch[notebooks]" # + jupyterlab, matplotlib, pandas
pip install "llm-from-scratch[all]"       # everything

Python 3.12+. A CUDA-capable GPU is recommended for the training chapters.

License

Apache-2.0. © 2026 Knowlytix.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_from_scratch-0.1.0.tar.gz (89.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_from_scratch-0.1.0-py3-none-any.whl (99.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_from_scratch-0.1.0.tar.gz.

File metadata

  • Download URL: llm_from_scratch-0.1.0.tar.gz
  • Upload date:
  • Size: 89.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_from_scratch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 636fa67dc27da4ed04ea3c4472dddb927b96e5f408ebb0a2c64a53d5e1163774
MD5 27ed32b6bdea8f73b2fdb1b51ca576ee
BLAKE2b-256 bbd2428d524af043418c6fccf66e90e0a3524927459e238bbd848714d15c6dc8

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_from_scratch-0.1.0.tar.gz:

Publisher: publish.yml on knowlytix/llm-from-scratch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_from_scratch-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_from_scratch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e52e51b14e474a45cc9b5ee3e1b635e3af460f2a020ce85aa2880d149d85871
MD5 090d2bb4a1538351b3146129ce5a1c66
BLAKE2b-256 faa9c653eb00c128a40cc6c5efdba96d434fba2a482c2edc3faefd239d18efd1

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_from_scratch-0.1.0-py3-none-any.whl:

Publisher: publish.yml on knowlytix/llm-from-scratch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page