A template for nbdev-based project

These details have not been verified by PyPI

Project links

Homepage

Project description

buildNanoGPT

buildNanoGPT is developed based on Andrej Karpathy’s build-nanoGPT repo and Let’s reproduce GPT-2 (124M) with added notes and details for teaching purposes using nbdev, which enables package development, testing, documentation, and dissemination all in one place - Jupyter Notebook or Visual Studio Code Jupyter Notebook in my case 😄.

Literate Programming

buildNanoGPT

flowchart LR
  A(Andrej's build-nanoGPT) --> C((Combination))
  B(Jeremy's nbdev) --> C
  C -->|Literate Programming| D(buildNanoGPT)

micrograd2023

Disclaimers

buildNanoGPT is written based on Andrej Karpathy’s github repo named build-nanoGPT and his “Neural Networks: Zero to Hero” lecture series. Specifically the lecture called Let’s reproduce GPT-2 (124M).

Andrej is the man who needs no introduction in the field of Deep Learning. He released a series of lectures called Neural Network: Zero to Hero, which I found extremely educational and practical. I am reviewing the lectures and creating notes for myself and for teaching purposes.

buildNanoGPT was written using nbdev, which was developed by Jeremy Howard, the man who also needs no introduction in the field of Deep Learning. Jeremy created fastai Deep Learning software library and Courses that are extremely influential. I highly recommend fastai if you are interested in starting your journey and learning with ML and DL.

nbdev is a powerful tool that can be used to efficiently develop, build, test, document, and distribute software packages all in one place, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am using.

If you study lectures by Andrej and Jeremy you will probably notice that they are both great educators and utilize both top-down and bottom-up approaches in their teaching, but Andrej predominantly uses bottom-up approach while Jeremy predominantly uses top-down one. I personally fascinated by both educators and found values from both of them and hope you are too!

Usage

Prepare FineWeb-Edu-10B data

from buildNanoGPT import data
import tiktoken
import numpy as np

enc = tiktoken.get_encoding("gpt2")
eot = enc._special_tokens['<|endoftext|>'] # end of text token
eot

t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.uint16)
t_ref

array([50256, 15496,    11,   995,     0], dtype=uint16)

t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.int32)
t_ref

array([50256, 15496,    11,   995,     0], dtype=int32)

doc = {"text":"Hello, world!"}
t_test = data.tokenize(doc)
t_test

array([50256, 15496,    11,   995,     0], dtype=uint16)

assert np.all(t_ref == t_test)

# Download and Prepare the FineWeb-Edu-10B sample Data
data.edu_fineweb10B_prep(is_test=True)

Resolving data files:   0%|          | 0/1630 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/98 [00:00<?, ?it/s]

'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'

Prepare HellaSwag Evaluation data

data.hellaswag_val_prep(is_test=True)

'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'

Load Pre-trained Weight

from buildNanoGPT.model import GPT, GPTConfig
from buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text
import tiktoken
import torch
from torch.nn import functional as F

master_process = True
model = GPT.from_pretrained("gpt2", master_process)

loading weights from pretrained gpt: gpt2

enc = tiktoken.get_encoding('gpt2')

ddp_cf = DDPConfig()
model.to(ddp_cf.device)

using device: cuda

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='tanh')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

generate_text(model, enc, ddp_cf)

rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier
rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that
rank 0 sample 2: Hello, I'm a language model, not a script," he said.

Banks and regulators will likely be wary of such a move, but for
rank 0 sample 3: Hello, I'm a language model, you must understand this.

So what really happened?

This article would be too short and concise. That

Training

import modules and functions

from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model
import torch

set seed for random number generator for reproducibility

set_random_seed(seed=1337) # for reproducibility

initiate DDP and Training configs - read the document and modify the config parameters as desired

ddp_cf = DDPConfig()

using device: cuda

train_cf = TrainingConfig()

using device: cuda

setup train and validation dataloaders

train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")

found 99 shards for split train
found 1 shards for split val

set up the GPT model

model = create_model(ddp_cf)

train the GPT model

train_GPT(model, train_loader, val_loader, train_cf, ddp_cf)

total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,354,560 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 10.9834
HellaSwag accuracy: 2534/10042=0.2523
step     0 | loss: 10.981724 | lr 6.0000e-05 | norm: 15.4339 | dt: 82819.52ms | tok/sec: 6330.49
step     1 | loss: 10.157787 | lr 1.2000e-04 | norm: 6.5679 | dt: 10668.81ms | tok/sec: 49142.14
step     2 | loss: 9.793260 | lr 1.8000e-04 | norm: 2.8270 | dt: 10747.73ms | tok/sec: 48781.28
step     3 | loss: 9.575678 | lr 2.4000e-04 | norm: 2.2934 | dt: 10789.36ms | tok/sec: 48593.07
step     4 | loss: 9.409717 | lr 3.0000e-04 | norm: 2.0182 | dt: 10883.30ms | tok/sec: 48173.61
step     5 | loss: 9.196922 | lr 3.6000e-04 | norm: 2.0160 | dt: 10734.89ms | tok/sec: 48839.61
step     6 | loss: 8.960140 | lr 4.2000e-04 | norm: 1.8684 | dt: 10902.57ms | tok/sec: 48088.46
step     7 | loss: 8.707756 | lr 4.8000e-04 | norm: 1.5884 | dt: 10851.94ms | tok/sec: 48312.84
step     8 | loss: 8.428266 | lr 5.4000e-04 | norm: 1.3737 | dt: 10883.36ms | tok/sec: 48173.34
step     9 | loss: 8.166906 | lr 6.0000e-04 | norm: 1.1468 | dt: 10797.07ms | tok/sec: 48558.35
step    10 | loss: 8.857561 | lr 6.0000e-04 | norm: 23.7457 | dt: 10755.35ms | tok/sec: 48746.74
step    11 | loss: 7.858195 | lr 5.8679e-04 | norm: 0.8712 | dt: 10667.08ms | tok/sec: 49150.09
step    12 | loss: 7.823021 | lr 5.4843e-04 | norm: 0.7075 | dt: 10793.02ms | tok/sec: 48576.59
step    13 | loss: 7.755527 | lr 4.8870e-04 | norm: 0.6744 | dt: 10827.16ms | tok/sec: 48423.42
step    14 | loss: 7.593850 | lr 4.1343e-04 | norm: 0.5836 | dt: 10730.71ms | tok/sec: 48858.64
step    15 | loss: 7.618423 | lr 3.3000e-04 | norm: 0.6430 | dt: 10648.68ms | tok/sec: 49235.03
step    16 | loss: 7.664069 | lr 2.4657e-04 | norm: 0.5456 | dt: 10749.31ms | tok/sec: 48774.10
step    17 | loss: 7.603458 | lr 1.7130e-04 | norm: 0.6211 | dt: 10837.78ms | tok/sec: 48375.97
step    18 | loss: 7.809735 | lr 1.1157e-04 | norm: 0.4929 | dt: 10698.80ms | tok/sec: 49004.37
validation loss: 7.6044
HellaSwag accuracy: 2448/10042=0.2438
rank 0 sample 0: Hello, I'm a language model,:
 the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
 to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
 or:
 the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
 of:)
step    19 | loss: 7.893970 | lr 7.3215e-05 | norm: 0.6688 | dt: 85602.68ms | tok/sec: 6124.67

Load Checkpoint

from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken

set up the GPT model

ddp_cf = DDPConfig()
model = create_model(ddp_cf)

using device: cuda

load the model weights from the saved checkpoint

model_checkpoint = torch.load("log/model_00019.pt")
checkpoint_state_dict = model_checkpoint['model']
model.load_state_dict(checkpoint_state_dict)

<All keys matched successfully>

generate text from saved weights

enc = tiktoken.get_encoding('gpt2')
generate_text(model, enc, ddp_cf)

rank 0 sample 0: Hello, I'm a language model,:
 the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
 to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
 or:
 the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
 of:)

Fine-tune from OpenAI’s weights

from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken

load OpenAI’s pre-trained weights

ddp_cf = DDPConfig()
model_fine = GPT.from_pretrained("gpt2", ddp_cf.master_process)
model_fine.to(ddp_cf.device)

using device: cuda
loading weights from pretrained gpt: gpt2

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='tanh')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

set seed for reproducibility

set_random_seed(seed=1337) # for reproducibility

set up training parameters - set max_lr to a small number since it is a fine-tuning step. More advance fine-tuning may include supervised fine-tuning (SFT) using custom data and finer control on which layer has more or less fine-tuning effects.

train_cf = TrainingConfig(max_lr=1e-6)

using device: cuda

set up train and validation data-loaders

train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")

found 99 shards for split train
found 1 shards for split val

fine tuning the model

train_GPT(model_fine, train_loader, val_loader, train_cf, ddp_cf)

total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 3.2530
HellaSwag accuracy: 2970/10042=0.2958
step     0 | loss: 3.279157 | lr 1.0000e-07 | norm: 2.3655 | dt: 80251.91ms | tok/sec: 6533.03
step     1 | loss: 3.322400 | lr 2.0000e-07 | norm: 2.3916 | dt: 10466.55ms | tok/sec: 50091.77
step     2 | loss: 3.310521 | lr 3.0000e-07 | norm: 2.5691 | dt: 10404.72ms | tok/sec: 50389.42
step     3 | loss: 3.403320 | lr 4.0000e-07 | norm: 2.5293 | dt: 10539.22ms | tok/sec: 49746.40
step     4 | loss: 3.280189 | lr 5.0000e-07 | norm: 2.5589 | dt: 10462.80ms | tok/sec: 50109.70
step     5 | loss: 3.341536 | lr 6.0000e-07 | norm: 2.4456 | dt: 10489.14ms | tok/sec: 49983.90
step     6 | loss: 3.388632 | lr 7.0000e-07 | norm: 2.3444 | dt: 10656.34ms | tok/sec: 49199.62
step     7 | loss: 3.336595 | lr 8.0000e-07 | norm: 2.4381 | dt: 10750.67ms | tok/sec: 48767.94
step     8 | loss: 3.358722 | lr 9.0000e-07 | norm: 2.0390 | dt: 10728.56ms | tok/sec: 48868.44
step     9 | loss: 3.303847 | lr 1.0000e-06 | norm: 2.5693 | dt: 10549.71ms | tok/sec: 49696.89
step    10 | loss: 3.338424 | lr 1.0000e-06 | norm: 2.5449 | dt: 10565.95ms | tok/sec: 49620.54
step    11 | loss: 3.326447 | lr 9.7798e-07 | norm: 2.2862 | dt: 10577.53ms | tok/sec: 49566.18
step    12 | loss: 3.297659 | lr 9.1406e-07 | norm: 2.2453 | dt: 10640.80ms | tok/sec: 49271.47
step    13 | loss: 3.298663 | lr 8.1450e-07 | norm: 2.2228 | dt: 10551.25ms | tok/sec: 49689.67
step    14 | loss: 3.304088 | lr 6.8906e-07 | norm: 2.5593 | dt: 10415.45ms | tok/sec: 50337.54
step    15 | loss: 3.373518 | lr 5.5000e-07 | norm: 2.3321 | dt: 10446.78ms | tok/sec: 50186.59
step    16 | loss: 3.314626 | lr 4.1094e-07 | norm: 2.3768 | dt: 10416.73ms | tok/sec: 50331.33
step    17 | loss: 3.331042 | lr 2.8550e-07 | norm: 2.1369 | dt: 10248.14ms | tok/sec: 51159.35
step    18 | loss: 3.334763 | lr 1.8594e-07 | norm: 1.8012 | dt: 10206.37ms | tok/sec: 51368.71
validation loss: 3.2394
HellaSwag accuracy: 2959/10042=0.2947
rank 0 sample 0: Hello, I'm a language model, and I know how it works: You, to my knowledge, invented Java!

We all do the same stuff
rank 0 sample 1: Hello, I'm a language model, not a function. It's the last thing that works here, I guess. I think this is very much a misunderstanding
rank 0 sample 2: Hello, I'm a language model, not a writing language. Let's use a syntax like this (which is a bit different from the one in C):
rank 0 sample 3: Hello, I'm a language model, you and I can talk about it!" He also said that he doesn't want to use other people's language, nor
step    19 | loss: 3.189983 | lr 1.2202e-07 | norm: 1.9916 | dt: 80862.14ms | tok/sec: 6483.73

Notes: using smaller max_lr, running with more steps, and trying advanced fine-tuning techniques to see a more pronounced impacts.

Visualize the Loss

from buildNanoGPT.viz import plot_log

plot_log(log_file='log/log_6500steps.txt', sz='124M')

Min Train Loss: 2.997356
Min Validation Loss: 3.275
Max Hellaswag eval: 0.2782

How to install

The buildNanoGPT package was uploaded to PyPI and can be easily installed using the below command.

pip install buildNanoGPT

Developer install

If you want to develop buildNanoGPT yourself, please use an editable installation.

git clone https://github.com/hdocmsu/buildNanoGPT.git

pip install -e "buildNanoGPT[dev]"

You also need to use an editable installation of nbdev, fastcore, and execnb.

Happy Coding!!!

Note: buildNanoGPT is currently Work in Progress (WIP).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.6

Jul 12, 2024

0.1.5

Jul 9, 2024

0.1.4

Jul 9, 2024

0.1.3

Jul 9, 2024

0.1.2

Jul 9, 2024

0.1.1

Jul 7, 2024

0.1.0

Jul 7, 2024

0.0.9

Jul 7, 2024

0.0.8

Jul 6, 2024

0.0.7

Jul 6, 2024

0.0.6

Jul 6, 2024

0.0.5

Jul 6, 2024

0.0.4

Jul 4, 2024

0.0.3

Jul 4, 2024

0.0.2

Jul 4, 2024

0.0.1

Jul 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buildnanogpt-0.1.6.tar.gz (31.1 kB view details)

Uploaded Jul 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

buildNanoGPT-0.1.6-py3-none-any.whl (27.5 kB view details)

Uploaded Jul 12, 2024 Python 3

File details

Details for the file buildnanogpt-0.1.6.tar.gz.

File metadata

Download URL: buildnanogpt-0.1.6.tar.gz
Upload date: Jul 12, 2024
Size: 31.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for buildnanogpt-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`93fa6706469f95a5330f76a1d35b5321ca0016be12ef7ba1a9bfa0104484a383`
MD5	`89feae663dc7981c83c429717eb83220`
BLAKE2b-256	`ecc1913df8187d9c6da6ba01e2e912722e04215d0f260c207dc9b7068e90c873`

See more details on using hashes here.

File details

Details for the file buildNanoGPT-0.1.6-py3-none-any.whl.

File metadata

Download URL: buildNanoGPT-0.1.6-py3-none-any.whl
Upload date: Jul 12, 2024
Size: 27.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for buildNanoGPT-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32bfa91077141ca978011bbef10ca930406956b462f261d5302551122860e289`
MD5	`f4e3b7c8f53b2ed4a1d2ca3bf72f7f5f`
BLAKE2b-256	`5ef20ab601734ec52b6a0130a430313ba4a95a81e5f6907c7c4f36b526e0947d`

See more details on using hashes here.

buildNanoGPT 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

buildNanoGPT

Literate Programming

Disclaimers

Usage

Prepare FineWeb-Edu-10B data

Prepare HellaSwag Evaluation data

Load Pre-trained Weight

Training

Load Checkpoint

Fine-tune from OpenAI’s weights

Visualize the Loss

How to install

Developer install

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes