A template for nbdev-based project
Project description
buildNanoGPT
buildNanoGPTis developed based on Andrej Karpathy’s build-nanoGPT repo and Let’s reproduce GPT-2 (124M) with added notes and details for teaching purposes using nbdev, which enables package development, testing, documentation, and dissemination all in one place - Jupyter Notebook or Visual Studio Code Jupyter Notebook in my case 😄.
Literate Programming
buildNanoGPT
flowchart LR
A(Andrej's build-nanoGPT) --> C((Combination))
B(Jeremy's nbdev) --> C
C -->|Literate Programming| D(buildNanoGPT)
micrograd2023
Disclaimers
buildNanoGPT is written based on Andrej
Karpathy’s github repo named
build-nanoGPT and his “Neural
Networks: Zero to
Hero”
lecture series. Specifically the lecture called Let’s reproduce GPT-2
(124M).
Andrej is the man who needs no introduction in the field of Deep Learning. He released a series of lectures called Neural Network: Zero to Hero, which I found extremely educational and practical. I am reviewing the lectures and creating notes for myself and for teaching purposes.
buildNanoGPT was written using nbdev, which
was developed by Jeremy Howard, the man who
also needs no introduction in the field of Deep Learning. Jeremy created
fastai Deep Learning software library and
Courses that are extremely influential. I
highly recommend fastai if you are interested in starting your journey
and learning with ML and DL.
nbdev is a powerful tool that can be used to efficiently develop,
build, test, document, and distribute software packages all in one
place, Jupyter Notebook or Jupyter Notebooks in VS Code, which I am
using.
If you study lectures by Andrej and Jeremy you will probably notice that they are both great educators and utilize both top-down and bottom-up approaches in their teaching, but Andrej predominantly uses bottom-up approach while Jeremy predominantly uses top-down one. I personally fascinated by both educators and found values from both of them and hope you are too!
Usage
Prepare FineWeb-Edu-10B data
from buildNanoGPT import data
import tiktoken
import numpy as np
enc = tiktoken.get_encoding("gpt2")
eot = enc._special_tokens['<|endoftext|>'] # end of text token
eot
50256
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.uint16)
t_ref
array([50256, 15496, 11, 995, 0], dtype=uint16)
t_ref = [eot]
t_ref.extend(enc.encode("Hello, world!"))
t_ref = np.array(t_ref).astype(np.int32)
t_ref
array([50256, 15496, 11, 995, 0], dtype=int32)
doc = {"text":"Hello, world!"}
t_test = data.tokenize(doc)
t_test
array([50256, 15496, 11, 995, 0], dtype=uint16)
assert np.all(t_ref == t_test)
# Download and Prepare the FineWeb-Edu-10B sample Data
data.edu_fineweb10B_prep(is_test=True)
Resolving data files: 0%| | 0/1630 [00:00<?, ?it/s]
Loading dataset shards: 0%| | 0/98 [00:00<?, ?it/s]
'Hello from `prepare_edu_fineweb10B()`! if you want to download the dataset, set is_test=False and run again.'
Prepare HellaSwag Evaluation data
data.hellaswag_val_prep(is_test=True)
'Hello from `hellaswag_val_prep()`! if you want to download the dataset, set is_test=False and run again.'
Load Pre-trained Weight
from buildNanoGPT.model import GPT, GPTConfig
from buildNanoGPT.train import DDPConfig, TrainingConfig, generate_text
import tiktoken
import torch
from torch.nn import functional as F
master_process = True
model = GPT.from_pretrained("gpt2", master_process)
loading weights from pretrained gpt: gpt2
enc = tiktoken.get_encoding('gpt2')
ddp_cf = DDPConfig()
model.to(ddp_cf.device)
using device: cuda
GPT(
(transformer): ModuleDict(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(h): ModuleList(
(0-11): 12 x Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): CausalSelfAttention(
(c_attn): Linear(in_features=768, out_features=2304, bias=True)
(c_proj): Linear(in_features=768, out_features=768, bias=True)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MLP(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(gelu): GELU(approximate='tanh')
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
generate_text(model, enc, ddp_cf)
rank 0 sample 0: Hello, I'm a language model, and I do not want to use some third-party file manager I used on my laptop. It would probably be easier
rank 0 sample 1: Hello, I'm a language model, not a problem solver. I should be writing. In the first book, I was in the trouble of proving that
rank 0 sample 2: Hello, I'm a language model, not a script," he said.
Banks and regulators will likely be wary of such a move, but for
rank 0 sample 3: Hello, I'm a language model, you must understand this.
So what really happened?
This article would be too short and concise. That
Training
- import modules and functions
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model
import torch
- set seed for random number generator for reproducibility
set_random_seed(seed=1337) # for reproducibility
- initiate DDP and Training configs - read the document and modify the config parameters as desired
ddp_cf = DDPConfig()
using device: cuda
train_cf = TrainingConfig()
using device: cuda
- setup train and validation dataloaders
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
found 99 shards for split train
found 1 shards for split val
- set up the GPT model
model = create_model(ddp_cf)
- train the GPT model
train_GPT(model, train_loader, val_loader, train_cf, ddp_cf)
total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,354,560 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 10.9834
HellaSwag accuracy: 2534/10042=0.2523
step 0 | loss: 10.981724 | lr 6.0000e-05 | norm: 15.4339 | dt: 82819.52ms | tok/sec: 6330.49
step 1 | loss: 10.157787 | lr 1.2000e-04 | norm: 6.5679 | dt: 10668.81ms | tok/sec: 49142.14
step 2 | loss: 9.793260 | lr 1.8000e-04 | norm: 2.8270 | dt: 10747.73ms | tok/sec: 48781.28
step 3 | loss: 9.575678 | lr 2.4000e-04 | norm: 2.2934 | dt: 10789.36ms | tok/sec: 48593.07
step 4 | loss: 9.409717 | lr 3.0000e-04 | norm: 2.0182 | dt: 10883.30ms | tok/sec: 48173.61
step 5 | loss: 9.196922 | lr 3.6000e-04 | norm: 2.0160 | dt: 10734.89ms | tok/sec: 48839.61
step 6 | loss: 8.960140 | lr 4.2000e-04 | norm: 1.8684 | dt: 10902.57ms | tok/sec: 48088.46
step 7 | loss: 8.707756 | lr 4.8000e-04 | norm: 1.5884 | dt: 10851.94ms | tok/sec: 48312.84
step 8 | loss: 8.428266 | lr 5.4000e-04 | norm: 1.3737 | dt: 10883.36ms | tok/sec: 48173.34
step 9 | loss: 8.166906 | lr 6.0000e-04 | norm: 1.1468 | dt: 10797.07ms | tok/sec: 48558.35
step 10 | loss: 8.857561 | lr 6.0000e-04 | norm: 23.7457 | dt: 10755.35ms | tok/sec: 48746.74
step 11 | loss: 7.858195 | lr 5.8679e-04 | norm: 0.8712 | dt: 10667.08ms | tok/sec: 49150.09
step 12 | loss: 7.823021 | lr 5.4843e-04 | norm: 0.7075 | dt: 10793.02ms | tok/sec: 48576.59
step 13 | loss: 7.755527 | lr 4.8870e-04 | norm: 0.6744 | dt: 10827.16ms | tok/sec: 48423.42
step 14 | loss: 7.593850 | lr 4.1343e-04 | norm: 0.5836 | dt: 10730.71ms | tok/sec: 48858.64
step 15 | loss: 7.618423 | lr 3.3000e-04 | norm: 0.6430 | dt: 10648.68ms | tok/sec: 49235.03
step 16 | loss: 7.664069 | lr 2.4657e-04 | norm: 0.5456 | dt: 10749.31ms | tok/sec: 48774.10
step 17 | loss: 7.603458 | lr 1.7130e-04 | norm: 0.6211 | dt: 10837.78ms | tok/sec: 48375.97
step 18 | loss: 7.809735 | lr 1.1157e-04 | norm: 0.4929 | dt: 10698.80ms | tok/sec: 49004.37
validation loss: 7.6044
HellaSwag accuracy: 2448/10042=0.2438
rank 0 sample 0: Hello, I'm a language model,:
the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
or:
the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
of:)
step 19 | loss: 7.893970 | lr 7.3215e-05 | norm: 0.6688 | dt: 85602.68ms | tok/sec: 6124.67
Load Checkpoint
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
- set up the GPT model
ddp_cf = DDPConfig()
model = create_model(ddp_cf)
using device: cuda
- load the model weights from the saved checkpoint
model_checkpoint = torch.load("log/model_00019.pt")
checkpoint_state_dict = model_checkpoint['model']
model.load_state_dict(checkpoint_state_dict)
<All keys matched successfully>
- generate text from saved weights
enc = tiktoken.get_encoding('gpt2')
generate_text(model, enc, ddp_cf)
rank 0 sample 0: Hello, I'm a language model,:
the on a a in is at on in� and are you in the to their for and in the a
rank 0 sample 1: Hello, I'm a language model,� or an, and or and �, and you by are in
to a of or. ( of the to
rank 0 sample 2: Hello, I'm a language model,.
or:
the an-, withs,- and to the a.
, who, and�
rank 0 sample 3: Hello, I'm a language model, a by� to, for. that of they-, which are for and can- be.
of:)
Fine-tune from OpenAI’s weights
from buildNanoGPT.train import train_GPT, set_random_seed
from buildNanoGPT.model import GPT, GPTConfig, DataLoaderLite
from buildNanoGPT.train import DDPConfig, TrainingConfig, create_model, generate_text
import torch
import tiktoken
- load OpenAI’s pre-trained weights
ddp_cf = DDPConfig()
model_fine = GPT.from_pretrained("gpt2", ddp_cf.master_process)
model_fine.to(ddp_cf.device)
using device: cuda
loading weights from pretrained gpt: gpt2
GPT(
(transformer): ModuleDict(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(h): ModuleList(
(0-11): 12 x Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): CausalSelfAttention(
(c_attn): Linear(in_features=768, out_features=2304, bias=True)
(c_proj): Linear(in_features=768, out_features=768, bias=True)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MLP(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(gelu): GELU(approximate='tanh')
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
- set seed for reproducibility
set_random_seed(seed=1337) # for reproducibility
- set up training parameters - set
max_lrto a small number since it is a fine-tuning step. More advance fine-tuning may include supervised fine-tuning (SFT) using custom data and finer control on which layer has more or less fine-tuning effects.
train_cf = TrainingConfig(max_lr=1e-6)
using device: cuda
- set up train and validation data-loaders
train_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split='train')
val_loader = DataLoaderLite(B=train_cf.B, T=train_cf.T, ddp_cf=ddp_cf, split="val")
found 99 shards for split train
found 1 shards for split val
- fine tuning the model
train_GPT(model_fine, train_loader, val_loader, train_cf, ddp_cf)
total desired batch size: 524288
=> calculated gradient accumulation steps: 32
num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
validation loss: 3.2530
HellaSwag accuracy: 2970/10042=0.2958
step 0 | loss: 3.279157 | lr 1.0000e-07 | norm: 2.3655 | dt: 80251.91ms | tok/sec: 6533.03
step 1 | loss: 3.322400 | lr 2.0000e-07 | norm: 2.3916 | dt: 10466.55ms | tok/sec: 50091.77
step 2 | loss: 3.310521 | lr 3.0000e-07 | norm: 2.5691 | dt: 10404.72ms | tok/sec: 50389.42
step 3 | loss: 3.403320 | lr 4.0000e-07 | norm: 2.5293 | dt: 10539.22ms | tok/sec: 49746.40
step 4 | loss: 3.280189 | lr 5.0000e-07 | norm: 2.5589 | dt: 10462.80ms | tok/sec: 50109.70
step 5 | loss: 3.341536 | lr 6.0000e-07 | norm: 2.4456 | dt: 10489.14ms | tok/sec: 49983.90
step 6 | loss: 3.388632 | lr 7.0000e-07 | norm: 2.3444 | dt: 10656.34ms | tok/sec: 49199.62
step 7 | loss: 3.336595 | lr 8.0000e-07 | norm: 2.4381 | dt: 10750.67ms | tok/sec: 48767.94
step 8 | loss: 3.358722 | lr 9.0000e-07 | norm: 2.0390 | dt: 10728.56ms | tok/sec: 48868.44
step 9 | loss: 3.303847 | lr 1.0000e-06 | norm: 2.5693 | dt: 10549.71ms | tok/sec: 49696.89
step 10 | loss: 3.338424 | lr 1.0000e-06 | norm: 2.5449 | dt: 10565.95ms | tok/sec: 49620.54
step 11 | loss: 3.326447 | lr 9.7798e-07 | norm: 2.2862 | dt: 10577.53ms | tok/sec: 49566.18
step 12 | loss: 3.297659 | lr 9.1406e-07 | norm: 2.2453 | dt: 10640.80ms | tok/sec: 49271.47
step 13 | loss: 3.298663 | lr 8.1450e-07 | norm: 2.2228 | dt: 10551.25ms | tok/sec: 49689.67
step 14 | loss: 3.304088 | lr 6.8906e-07 | norm: 2.5593 | dt: 10415.45ms | tok/sec: 50337.54
step 15 | loss: 3.373518 | lr 5.5000e-07 | norm: 2.3321 | dt: 10446.78ms | tok/sec: 50186.59
step 16 | loss: 3.314626 | lr 4.1094e-07 | norm: 2.3768 | dt: 10416.73ms | tok/sec: 50331.33
step 17 | loss: 3.331042 | lr 2.8550e-07 | norm: 2.1369 | dt: 10248.14ms | tok/sec: 51159.35
step 18 | loss: 3.334763 | lr 1.8594e-07 | norm: 1.8012 | dt: 10206.37ms | tok/sec: 51368.71
validation loss: 3.2394
HellaSwag accuracy: 2959/10042=0.2947
rank 0 sample 0: Hello, I'm a language model, and I know how it works: You, to my knowledge, invented Java!
We all do the same stuff
rank 0 sample 1: Hello, I'm a language model, not a function. It's the last thing that works here, I guess. I think this is very much a misunderstanding
rank 0 sample 2: Hello, I'm a language model, not a writing language. Let's use a syntax like this (which is a bit different from the one in C):
rank 0 sample 3: Hello, I'm a language model, you and I can talk about it!" He also said that he doesn't want to use other people's language, nor
step 19 | loss: 3.189983 | lr 1.2202e-07 | norm: 1.9916 | dt: 80862.14ms | tok/sec: 6483.73
Notes: using smaller max_lr, running with more steps, and trying
advanced fine-tuning techniques to see a more pronounced impacts.
Visualize the Loss
from buildNanoGPT.viz import plot_log
plot_log(log_file='log/log_6500steps.txt', sz='124M')
Min Train Loss: 2.997356
Min Validation Loss: 3.275
Max Hellaswag eval: 0.2782
How to install
The buildNanoGPT package was uploaded to PyPI and can be easily installed using the below command.
pip install buildNanoGPT
Developer install
If you want to develop buildNanoGPT yourself, please use an editable
installation.
git clone https://github.com/hdocmsu/buildNanoGPT.git
pip install -e "buildNanoGPT[dev]"
You also need to use an editable installation of nbdev, fastcore, and execnb.
Happy Coding!!!
Note: buildNanoGPT is currently Work in Progress (WIP).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file buildnanogpt-0.1.6.tar.gz.
File metadata
- Download URL: buildnanogpt-0.1.6.tar.gz
- Upload date:
- Size: 31.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93fa6706469f95a5330f76a1d35b5321ca0016be12ef7ba1a9bfa0104484a383
|
|
| MD5 |
89feae663dc7981c83c429717eb83220
|
|
| BLAKE2b-256 |
ecc1913df8187d9c6da6ba01e2e912722e04215d0f260c207dc9b7068e90c873
|
File details
Details for the file buildNanoGPT-0.1.6-py3-none-any.whl.
File metadata
- Download URL: buildNanoGPT-0.1.6-py3-none-any.whl
- Upload date:
- Size: 27.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32bfa91077141ca978011bbef10ca930406956b462f261d5302551122860e289
|
|
| MD5 |
f4e3b7c8f53b2ed4a1d2ca3bf72f7f5f
|
|
| BLAKE2b-256 |
5ef20ab601734ec52b6a0130a430313ba4a95a81e5f6907c7c4f36b526e0947d
|