Skip to main content

A template for nbdev-based project

Project description

buildNanoGPT

buildNanoGPT is developed based on Andrej Karpathy’s build-nanoGPT repo and Let’s reproduce GPT-2 (124M) with added notes and details for teaching purposes using nbdev, which enables package development, testing, documentation, and dissemination all in one place - Jupyter Notebook or Visual Studio Code Jupyter Notebook in my case 😄.

Literate Programming

buildNanoGPT

flowchart LR
  A(Andrej's build-nanoGPT) --> C((Combination))
  B(Jeremy's nbdev) --> C
  C -->|Literate Programming| D(buildNanoGPT)

micrograd2023

Disclaimers

buildNanoGPT is written based on Andrej Karpathy’s github repo named build-nanoGPT and his “Neural Networks: Zero to Hero” lecture series. Specifically the lecture called Let’s reproduce GPT-2 (124M).

Andrej is the man who needs no introduction in the field of Deep Learning. He released a series of lectures called Neural Network: Zero to Hero, which I found extremely educational and practical. I am reviewing the lectures and creating notes for myself and for teaching purposes.

buildNanoGPT was written using nbdev, which was developed by Jeremy Howard, the man who also needs no introduction in the field of Deep Learning. Jeremy created fastai Deep Learning software library and Courses that are extremely influential. I highly recommend fastai if you are interested in starting your journey and learning with ML and DL.

nbdev is a powerful tool that can be used to efficiently develop, build, test, document, and distribute software packages all in one place, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am using.

If you study lectures by Andrej and Jeremy you will probably notice that they are both great educators and utilize both top-down and bottom-up approaches in their teaching, but Andrej predominantly uses bottom-up approach while Jeremy predominantly uses top-down one. I personally fascinated by both educators and found values from both of them and hope you are too!

Usage

Prepare FineWeb-Edu-10B data

from buildNanoGPT import data
import tiktoken
import numpy as np
enc = tiktoken.get_encoding("gpt2")
eot = enc._special_tokens['<|endoftext|>'] # end of text token
eot
50256
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.uint16)
t_ref
array([50256, 15496,    11,   995,     0], dtype=uint16)
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.int32)
t_ref
array([50256, 15496,    11,   995,     0], dtype=int32)
doc = {"text":"Hello, world!"}
t_test = data.tokenize(doc)
t_test
array([50256, 15496,    11,   995,     0], dtype=uint16)
assert np.all(t_ref == t_test)
# Download and Prepare the FineWeb-Edu-10B sample Data
data.edu_fineweb10B_prep(is_test=True)
Resolving data files:   0%|          | 0/1630 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/98 [00:00<?, ?it/s]

'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'

Prepare HellaSwag Evaluation data

data.hellaswag_val_prep(is_test=True)
'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'

Load Pre-trained Weight

from buildNanoGPT.model import GPT, GPTConfig
from buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text
import tiktoken
import torch
from torch.nn import functional as F
master_process = True
model = GPT.from_pretrained("gpt2", master_process)
loading weights from pretrained gpt: gpt2
enc = tiktoken.get_encoding('gpt2')
ddp_cf = DDPConfig()
model.to(ddp_cf.device)
using device: cuda

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='tanh')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
generate_text(model, enc, ddp_cf)
rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier
rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that
rank 0 sample 2: Hello, I'm a language model, not a script," he said.

Banks and regulators will likely be wary of such a move, but for
rank 0 sample 3: Hello, I'm a language model, you must understand this.

So what really happened?

This article would be too short and concise. That

Training

  1. import modules and functions
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model
import torch
  1. set seed for random number generator for reproducibility
set_random_seed(seed=1337) # for reproducibility
  1. initiate DDP and Training configs - read the document and modify the config parameters as desired
ddp_cf = DDPConfig()
using device: cuda
train_cf = TrainingConfig()
using device: cuda
  1. setup train and validation dataloaders
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
found 99 shards for split train
found 1 shards for split val
  1. set up the GPT model
model = create_model(ddp_cf)
  1. train the GPT model
train_GPT(model, train_loader, val_loader, train_cf, ddp_cf)
total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,354,560 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 10.9834
HellaSwag accuracy: 2534/10042=0.2523
step     0 | loss: 10.981724 | lr 6.0000e-05 | norm: 15.4339 | dt: 82819.52ms | tok/sec: 6330.49
step     1 | loss: 10.157787 | lr 1.2000e-04 | norm: 6.5679 | dt: 10668.81ms | tok/sec: 49142.14
step     2 | loss: 9.793260 | lr 1.8000e-04 | norm: 2.8270 | dt: 10747.73ms | tok/sec: 48781.28
step     3 | loss: 9.575678 | lr 2.4000e-04 | norm: 2.2934 | dt: 10789.36ms | tok/sec: 48593.07
step     4 | loss: 9.409717 | lr 3.0000e-04 | norm: 2.0182 | dt: 10883.30ms | tok/sec: 48173.61
step     5 | loss: 9.196922 | lr 3.6000e-04 | norm: 2.0160 | dt: 10734.89ms | tok/sec: 48839.61
step     6 | loss: 8.960140 | lr 4.2000e-04 | norm: 1.8684 | dt: 10902.57ms | tok/sec: 48088.46
step     7 | loss: 8.707756 | lr 4.8000e-04 | norm: 1.5884 | dt: 10851.94ms | tok/sec: 48312.84
step     8 | loss: 8.428266 | lr 5.4000e-04 | norm: 1.3737 | dt: 10883.36ms | tok/sec: 48173.34
step     9 | loss: 8.166906 | lr 6.0000e-04 | norm: 1.1468 | dt: 10797.07ms | tok/sec: 48558.35
step    10 | loss: 8.857561 | lr 6.0000e-04 | norm: 23.7457 | dt: 10755.35ms | tok/sec: 48746.74
step    11 | loss: 7.858195 | lr 5.8679e-04 | norm: 0.8712 | dt: 10667.08ms | tok/sec: 49150.09
step    12 | loss: 7.823021 | lr 5.4843e-04 | norm: 0.7075 | dt: 10793.02ms | tok/sec: 48576.59
step    13 | loss: 7.755527 | lr 4.8870e-04 | norm: 0.6744 | dt: 10827.16ms | tok/sec: 48423.42
step    14 | loss: 7.593850 | lr 4.1343e-04 | norm: 0.5836 | dt: 10730.71ms | tok/sec: 48858.64
step    15 | loss: 7.618423 | lr 3.3000e-04 | norm: 0.6430 | dt: 10648.68ms | tok/sec: 49235.03
step    16 | loss: 7.664069 | lr 2.4657e-04 | norm: 0.5456 | dt: 10749.31ms | tok/sec: 48774.10
step    17 | loss: 7.603458 | lr 1.7130e-04 | norm: 0.6211 | dt: 10837.78ms | tok/sec: 48375.97
step    18 | loss: 7.809735 | lr 1.1157e-04 | norm: 0.4929 | dt: 10698.80ms | tok/sec: 49004.37
validation loss: 7.6044
HellaSwag accuracy: 2448/10042=0.2438
rank 0 sample 0: Hello, I'm a language model,:
 the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
 to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
 or:
 the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
 of:)
step    19 | loss: 7.893970 | lr 7.3215e-05 | norm: 0.6688 | dt: 85602.68ms | tok/sec: 6124.67

Load Checkpoint

from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
  1. set up the GPT model
ddp_cf = DDPConfig()
model = create_model(ddp_cf)
using device: cuda
  1. load the model weights from the saved checkpoint
model_checkpoint = torch.load("log/model_00019.pt")
checkpoint_state_dict = model_checkpoint['model']
model.load_state_dict(checkpoint_state_dict)
<All keys matched successfully>
  1. generate text from saved weights
enc = tiktoken.get_encoding('gpt2')
generate_text(model, enc, ddp_cf)
rank 0 sample 0: Hello, I'm a language model,:
 the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
 to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
 or:
 the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
 of:)

Fine-tune from OpenAI’s weights

from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
  1. load OpenAI’s pre-trained weights
ddp_cf = DDPConfig()
model_fine = GPT.from_pretrained("gpt2", ddp_cf.master_process)
model_fine.to(ddp_cf.device)
using device: cuda
loading weights from pretrained gpt: gpt2

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='tanh')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
  1. set seed for reproducibility
set_random_seed(seed=1337) # for reproducibility
  1. set up training parameters - set max_lr to a small number since it is a fine-tuning step. More advance fine-tuning may include supervised fine-tuning (SFT) using custom data and finer control on which layer has more or less fine-tuning effects.
train_cf = TrainingConfig(max_lr=1e-6)
using device: cuda
  1. set up train and validation data-loaders
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
found 99 shards for split train
found 1 shards for split val
  1. fine tuning the model
train_GPT(model_fine, train_loader, val_loader, train_cf, ddp_cf)
total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 3.2530
HellaSwag accuracy: 2970/10042=0.2958
step     0 | loss: 3.279157 | lr 1.0000e-07 | norm: 2.3655 | dt: 80251.91ms | tok/sec: 6533.03
step     1 | loss: 3.322400 | lr 2.0000e-07 | norm: 2.3916 | dt: 10466.55ms | tok/sec: 50091.77
step     2 | loss: 3.310521 | lr 3.0000e-07 | norm: 2.5691 | dt: 10404.72ms | tok/sec: 50389.42
step     3 | loss: 3.403320 | lr 4.0000e-07 | norm: 2.5293 | dt: 10539.22ms | tok/sec: 49746.40
step     4 | loss: 3.280189 | lr 5.0000e-07 | norm: 2.5589 | dt: 10462.80ms | tok/sec: 50109.70
step     5 | loss: 3.341536 | lr 6.0000e-07 | norm: 2.4456 | dt: 10489.14ms | tok/sec: 49983.90
step     6 | loss: 3.388632 | lr 7.0000e-07 | norm: 2.3444 | dt: 10656.34ms | tok/sec: 49199.62
step     7 | loss: 3.336595 | lr 8.0000e-07 | norm: 2.4381 | dt: 10750.67ms | tok/sec: 48767.94
step     8 | loss: 3.358722 | lr 9.0000e-07 | norm: 2.0390 | dt: 10728.56ms | tok/sec: 48868.44
step     9 | loss: 3.303847 | lr 1.0000e-06 | norm: 2.5693 | dt: 10549.71ms | tok/sec: 49696.89
step    10 | loss: 3.338424 | lr 1.0000e-06 | norm: 2.5449 | dt: 10565.95ms | tok/sec: 49620.54
step    11 | loss: 3.326447 | lr 9.7798e-07 | norm: 2.2862 | dt: 10577.53ms | tok/sec: 49566.18
step    12 | loss: 3.297659 | lr 9.1406e-07 | norm: 2.2453 | dt: 10640.80ms | tok/sec: 49271.47
step    13 | loss: 3.298663 | lr 8.1450e-07 | norm: 2.2228 | dt: 10551.25ms | tok/sec: 49689.67
step    14 | loss: 3.304088 | lr 6.8906e-07 | norm: 2.5593 | dt: 10415.45ms | tok/sec: 50337.54
step    15 | loss: 3.373518 | lr 5.5000e-07 | norm: 2.3321 | dt: 10446.78ms | tok/sec: 50186.59
step    16 | loss: 3.314626 | lr 4.1094e-07 | norm: 2.3768 | dt: 10416.73ms | tok/sec: 50331.33
step    17 | loss: 3.331042 | lr 2.8550e-07 | norm: 2.1369 | dt: 10248.14ms | tok/sec: 51159.35
step    18 | loss: 3.334763 | lr 1.8594e-07 | norm: 1.8012 | dt: 10206.37ms | tok/sec: 51368.71
validation loss: 3.2394
HellaSwag accuracy: 2959/10042=0.2947
rank 0 sample 0: Hello, I'm a language model, and I know how it works: You, to my knowledge, invented Java!

We all do the same stuff
rank 0 sample 1: Hello, I'm a language model, not a function. It's the last thing that works here, I guess. I think this is very much a misunderstanding
rank 0 sample 2: Hello, I'm a language model, not a writing language. Let's use a syntax like this (which is a bit different from the one in C):
rank 0 sample 3: Hello, I'm a language model, you and I can talk about it!" He also said that he doesn't want to use other people's language, nor
step    19 | loss: 3.189983 | lr 1.2202e-07 | norm: 1.9916 | dt: 80862.14ms | tok/sec: 6483.73

Notes: using smaller max_lr, running with more steps, and trying advanced fine-tuning techniques to see a more pronounced impacts.

Visualize the Loss

from buildNanoGPT.viz import plot_log
plot_log(log_file='log/log_6500steps.txt', sz='124M')
Min Train Loss: 2.997356
Min Validation Loss: 3.275
Max Hellaswag eval: 0.2782

How to install

The buildNanoGPT package was uploaded to PyPI and can be easily installed using the below command.

pip install buildNanoGPT

Developer install

If you want to develop buildNanoGPT yourself, please use an editable installation.

git clone https://github.com/hdocmsu/buildNanoGPT.git

pip install -e "buildNanoGPT[dev]"

You also need to use an editable installation of nbdev, fastcore, and execnb.

Happy Coding!!!

Note: buildNanoGPT is currently Work in Progress (WIP).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

buildnanogpt-0.1.6.tar.gz (31.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

buildNanoGPT-0.1.6-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file buildnanogpt-0.1.6.tar.gz.

File metadata

  • Download URL: buildnanogpt-0.1.6.tar.gz
  • Upload date:
  • Size: 31.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for buildnanogpt-0.1.6.tar.gz
Algorithm Hash digest
SHA256 93fa6706469f95a5330f76a1d35b5321ca0016be12ef7ba1a9bfa0104484a383
MD5 89feae663dc7981c83c429717eb83220
BLAKE2b-256 ecc1913df8187d9c6da6ba01e2e912722e04215d0f260c207dc9b7068e90c873

See more details on using hashes here.

File details

Details for the file buildNanoGPT-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: buildNanoGPT-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 27.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for buildNanoGPT-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 32bfa91077141ca978011bbef10ca930406956b462f261d5302551122860e289
MD5 f4e3b7c8f53b2ed4a1d2ca3bf72f7f5f
BLAKE2b-256 5ef20ab601734ec52b6a0130a430313ba4a95a81e5f6907c7c4f36b526e0947d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page