Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.

Project description


Description: [Implementation of GPT-2 language model for text generation, training custom models and fine-tuning.]


A class representing a language dataset for the Bigram Language Model.

Class Description


def __init__(self, path: str, train_val_split: int = 90)

Initialize the LanguageDataset with the provided text data.

  1. path (str): The file path to the text data.

  2. train_val_split (int, optional): The percentage split between training and validation data. Defaults to 90 (90% training, 10% validation).



def _load_dataset(self, path: str) -> str

Load the text data from the given file path.

  1. path (str): The file path to the text data.
  1. text (str): The loaded text data as a string.


def _encode(self, s: str) -> List[int]

Convert the text data into a list of numerical representations using character encoding.

  1. s (str): The input text data to be encoded.
  1. List[int]: A list of integer token indices representing the encoded text.


def _decode(self, l: List[int]) -> str

Convert a list of numerical representations back into the original text.

  1. l (List[int]): A list of integer token indices representing the encoded text.
  1. str: The decoded text as a string.


def _calculate_split(self, train_val_split: int) -> float

Calculate the split index for separating training and validation data.

  1. train_val_split (int): The percentage split between training and validation data.
  1. float: The split index as a float between 0 and 1, representing the percentage split.




def text(self) -> str

Get the loaded text data as a string.



def chars(self) -> List[str]

Get the list of unique characters present in the text data.



def vocab_size(self) -> int

Get the size of the character vocabulary.



def stoi(self) -> Dict[str, int]

Get a dictionary mapping characters to their corresponding integer token indices.



def itos(self) -> Dict[int, str]

Get a dictionary mapping integer token indices to their corresponding characters.



def train_data(self) -> torch.Tensor

Get the training data as a tensor containing numerical representations of the text.



def val_data(self) -> torch.Tensor

Get the validation data as a tensor containing numerical representations of the text.


A single head of self-attention used in the Transformer block.


def __init__(self, head_size, n_embd, block_size, dropout):

Initialize the Head with linear layers and a lower triangular mask.


  1. head_size (int): The size of the attention head.

  2. n_embd (int): The embedding dimension of the input.

  3. block_size (int): The maximum length of the input sequence.

  4. dropout (float): The dropout rate.



def forward(self, x):

Perform the forward pass of the attention head.

  1. x (torch.Tensor): The input tensor of shape (B, T, C), where B is the batch size, T is the sequence length, and C is the embedding dimension.
  1. torch.Tensor: The output tensor after the self-attention operation. It has the same shape as the input tensor (B, T, C).


Multi-head self-attention module used in the Transformer block.

Class Description


def __init__(self, num_heads, head_size, n_embd, block_size, dropout):

Initialize the MultiHeadAttention with attention heads and projection layer.


  1. num_heads (int): The number of attention heads.

  2. head_size (int): The size of each attention head.

  3. n_embd (int): The embedding dimension of the input.

  4. block_size (int): The maximum length of the input sequence.

  5. dropout (float): The dropout rate.



def forward(self, x):

Perform the forward pass of the multi-head self-attention module.

  1. x (torch.Tensor): The input tensor of shape (B, T, C), where B is the batch size, T is the sequence length, and C is the embedding dimension.
  1. torch.Tensor: The output tensor after applying the multi-head self-attention. It has the same shape as the input tensor (B, T, C).


Feed-forward neural network module used in the Transformer block.


def __init__(self, n_embd, dropout):

Initialize the FeedForward with linear layers and activation functions.


  1. n_embd (int): The embedding dimension of the input.

  2. dropout (float): The dropout rate.


def forward(self, x):

Perform the forward pass of the feed-forward neural network.

  1. x (torch.Tensor): The input tensor of shape (B, T, C), where B is the batch size, T is the sequence length, and C is the embedding dimension.
  1. torch.Tensor: The output tensor after the feed-forward operation. It has the same shape as the input tensor (B, T, C).


Transformer block module used in the BigramLanguageModel.


def __init__(self, n_embd, n_head, block_size, dropout):

Initialize the Transformer block with self-attention and feed-forward modules.


  1. n_embd (int): The embedding dimension of the input.

  2. n_head (int): The number of attention heads in the multi-head self-attention.

  3. block_size (int): The maximum length of the input sequence.

  4. dropout (float): The dropout rate.



def forward(self, x):

Perform the forward pass of the transformer block.

  1. x (torch.Tensor): The input tensor of shape (B, T, C), where B is the batch size, T is the sequence length, and C is the embedding dimension.
  1. torch.Tensor: The output tensor after applying the transformer block. It has the same shape as the input tensor (B, T, C).


Class Description


def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size, dropout)


  1. vocab_size (int): The size of the vocabulary, which determines the number of tokens in the language model.

  2. n_embd (int): The embedding dimension of the input.

  3. n_head (int): The number of attention heads in the multi-head self-attention.

  4. n_layer (int): The number of Transformer blocks in the model.

  5. block_size (int): The maximum length of the input sequence.

  6. dropout (float): The dropout rate for regularization.



def _build_block(self) -> Block

Creates a single Transformer block with specified embedding dimensions and attention heads.


def forward(self, idx: torch.Tensor, target: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, Optional[torch.Tensor]]

Performs the forward pass of the BigramLanguageModel.

  1. idx (torch.Tensor): The input tensor representing the token indices of shape (B, T), where B is the batch size, and T is the sequence length.

  2. target (torch.Tensor, optional): The target tensor representing the token indices for computing the loss. It should have the same shape as idx. If None, no loss will be computed. (Default: None)

  1. torch.Tensor: The logits tensor after the forward pass of the model. It has the shape (B, T, vocab_size).

  2. torch.Tensor: The loss tensor, computed using F.cross_entropy if the target is provided. Otherwise, None. It has the shape (B * T,) if target is not None.


def generate(self, idx: torch.Tensor, max_new_tokens: int) -> List[int]

Generates new tokens using the trained language model.

  1. idx (torch.Tensor): The input tensor representing the token indices of shape (B, T), where B is the batch size, and T is the sequence length.

  2. max_new_tokens (int): The maximum number of new tokens to generate.

  1. List[int]: A list of generated token indices as integers.

Utility Functions


def set_hyperparams(**kwargs)

Update the hyperparameters dictionary with the provided keyword arguments.


  1. **kwargs: Keyword arguments with hyperparameter names as keys and their corresponding values.


def load_dataset(path: str, train_val_split: int = 90) -> LanguageDataset

Load a dataset from the given file path and create a LanguageDataset object.


  1. path (str): The file path of the dataset to load.

  2. train_val_split (int, optional): Percentage split between training and validation data. Default is 90, which means 90% training and 10% validation.


  1. LanguageDataset : An instance of the LanguageDataset class containing the loaded dataset.


def initialize_model(vocab_size: int, n_embd: Optional[int] = None, n_head: Optional[int] = None, n_layer: Optional[int] = None, block_size: Optional[int] = None, dropout: Optional[float] = None, device: Optional[str] = None) -> BigramLanguageModel

Initialize a BigramLanguageModel with the given hyperparameters.


  1. vocab_size (int): The size of the vocabulary, which determines the number of tokens in the language model.

  2. n_embd (int, optional): The embedding dimension of the input. If not provided, it will be set to a default value.

  3. n_head (int, optional): The number of attention heads in the multi-head self-attention. If not provided, it will be set to a default value.

  4. n_layer (int, optional): The number of Transformer blocks in the model. If not provided, it will be set to a default value.

  5. block_size (int, optional): The maximum length of the input sequence. If not provided, it will be set to a default value.

  6. dropout (float, optional): The dropout rate for regularization. If not provided, it will be set to a default value.

  7. device (str, optional): The device on which to place the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.


  1. BigramLanguageModel: An instance of the BigramLanguageModel class initialized with the given hyperparameters.


def train(model: BigramLanguageModel, train_data: torch.Tensor, val_data: torch.Tensor, learning_rate: Optional[float] = None, max_iters: Optional[int] = None, eval_interval: Optional[int] = None, device: Optional[str] = None, eval_iters: Optional[int] = None) -> BigramLanguageModel

Train the provided model on the given training data and validate it on the validation data.


  1. model (BigramLanguageModel): The language model to train.

  2. train_data (torch.Tensor): The training data tensor.

  3. val_data (torch.Tensor): The validation data tensor.

  4. learning_rate (float, optional): The learning rate for the optimizer. If not provided, it will be set to a default value.

  5. max_iters (int, optional): The maximum number of training iterations. If not provided, it will be set to a default value.

  6. eval_interval (int, optional): Interval for printing evaluation results during training. If not provided, it will be set to a default value.

  7. device (str, optional): The device on which to train the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.

  8. eval_iters (int, optional): Number of iterations for evaluation during training. If not provided, it will be set to a default value.


  1. BigramLanguageModel: The trained model.


def save(path: str, model: BigramLanguageModel)

Save the state_dict of the provided model to the specified file path.


  1. path (str): The file path where the model state_dict will be saved.

  2. model (BigramLanguageModel): The model whose state_dict will be saved.


def load(path: str, model: BigramLanguageModel, device: Optional[str] = None)

Load the model state_dict from the specified file path and assign it to the provided model.


  1. path (str): The file path from which the model state_dict will be loaded.

  2. model (BigramLanguageModel): The model to which the loaded state_dict will be assigned.

  3. device (str, optional): The device on which to load the model (e.g., 'cpu' or 'cuda'). If not provided, it will be set based on the availability of a GPU.


def fine_tune_model(model: BigramLanguageModel, learning_rate: float, max_iters: int, eval_interval: int, train_data: torch.Tensor, val_data: torch.Tensor) -> BigramLanguageModel

Fine-tune the provided model using the given training data.


  1. model (BigramLanguageModel): The model to be fine-tuned.

  2. learning_rate (float): The learning rate for the optimizer during fine-tuning.

  3. max_iters (int): The maximum number of iterations for fine-tuning.

  4. eval_interval (int): The interval at which to evaluate the model's performance during fine-tuning.

  5. train_data (torch.Tensor): The training data for fine-tuning.

  6. val_data (torch.Tensor): The validation data for evaluating the model's performance during fine-tuning.


  1. BigramLanguageModel: The fine-tuned model.


def gpt2_124M(training_data_path: str, lr: float, max_iters: int, dropout: float = 0.1, eval_iters: int = 200, train_val_split: int = 85) -> BigramLanguageModel

Train a GPT-2 model with 124 million parameters on the provided training data.


  1. training_data_path (str): The path to the training data file.

  2. lr (float): The learning rate for the optimizer during training.

  3. max_iters (int): The maximum number of iterations for training.

  4. dropout (float, optional): The dropout rate to be used in the model. Default is 0.1.

  5. eval_iters (int, optional): The interval at which to evaluate the model's performance during training. Default is 200.

  6. train_val_split (int, optional): The percentage of data to be used for training, and the rest for validation. Default is 85.


  1. BigramLanguageModel: The trained GPT-2 model.


  1. Parameters count depends on the vocab_size.


def gpt2_finetune(pretrained_model: BigramLanguageModel, training_data_path: str, lr: float, max_iters: int, eval_iters: int = 200, train_val_split: int = 85) -> BigramLanguageModel

Fine-tune a pretrained GPT-2 model on the provided training data.


  1. pretrained_model (BigramLanguageModel): The pretrained GPT-2 model to be fine-tuned.

  2. training_data_path (str): The path to the training data file.

  3. lr (float): The learning rate for the optimizer during fine-tuning.

  4. max_iters (int): The maximum number of iterations for fine-tuning.

  5. eval_iters (int, optional): The interval at which to evaluate the model's performance during fine-tuning. Default is 200.

  6. train_val_split (int, optional): The percentage of data to be used for training, and the rest for validation. Default is 85.


  1. BigramLanguageModel: The fine-tuned model.


def count_parameters(model: torch.nn.Module) -> int

Count the total number of trainable parameters in the given PyTorch model.


  1. model (torch.nn.Module): The PyTorch model for which the parameters need to be counted.


  1. int: The total number of trainable parameters in the model.


def get_batch(split: str, train_data: torch.Tensor, val_data: torch.Tensor, block_size: int = None, batch_size: int = None, device: str = None) -> Tuple[torch.Tensor, torch.Tensor]

Get a batch of data for training or validation from the specified split.


  1. split (str): The split to get the data from ('train' or 'val').

  2. train_data (torch.Tensor): The training data tensor.

  3. val_data (torch.Tensor): The validation data tensor.

  4. block_size (int, optional): The size of each input sequence block. Defaults to None.

  5. batch_size (int, optional): The number of sequences in a batch. Defaults to None.

  6. device (str, optional): The device to store the batch data on ('cpu' or 'cuda'). Defaults to None.


  1. Tuple[torch.Tensor, torch.Tensor]: A tuple containing two tensors - the input tensor representing the batch of sequences (x) and the target tensor representing the batch of sequences shifted by one (y).


def estimate_loss(model: torch.nn.Module, train_data: torch.Tensor, val_data: torch.Tensor, eval_iters: int = None, block_size: int = None, batch_size: int = None, device: str = None) -> Dict[str, float]

Estimate the average loss of the model on the training and validation data.


  1. model (torch.nn.Module): The PyTorch model for which the loss is to be estimated.

  2. train_data (torch.Tensor): The training data tensor.

  3. val_data (torch.Tensor): The validation data tensor.

  4. eval_iters (int, optional): The number of iterations for estimating the loss. Defaults to None.

  5. block_size (int, optional): The size of each input sequence block. Defaults to None.

  6. batch_size (int, optional): The number of sequences in a batch. Defaults to None.

  7. device (str, optional): The device to perform computations on ('cpu' or 'cuda'). Defaults to None.


  1. Dict[str, float]: A dictionary containing the average loss on the training and validation data.

    Keys: 'train' and 'val'

    Values: Average loss values (float)


default_hyperparams = {
                            'batch_size': 64,
                            'block_size': 256,
                            'max_iters': 5000,
                            'learning_rate': 5.859375e-05,
                            'device': 'cuda' if torch.cuda.is_available() else 'cpu',
                            'eval_iters': 200,
                            'n_embd': 384,
                            'n_head': 6,
                            'n_layer': 6,
                            'dropout': 0.2,


A dictionary containing the default hyperparameters for the GPT-2 model.


  def set_hyperparamsdefault()

Set the default hyperparameters for the GPT-2 model using the set_hyperparams function.

Example Usage

Train model on default Hyperparameters

# Import the required functions and classes

from GPT2ML.Bigram import initialize_model, set_hyperparams, load_dataset, train, save

# Path to your training data file

training_data_path = 'path_to_your_training_data.txt'

# Load the dataset and split into training and validation sets (default split: 85% training, 15% validation)

dataset = load_dataset(training_data_path)

# set default hyperparams


# Initialize the GPT-2 model with default hyperparameters and train on the dataset

model = initialize_model(vocab_size=dataset.vocab_size) # keep record of the 'vocab_size' to use later.

# Train the model on the training data and validate it on the validation data

trained_model = train(model, dataset.train_data, dataset.val_data)

# Save the trained model to a file

save('', trained_model)

Train model on custom hyperparameters

# Import the required functions and classes

from GPT2ML.Bigram import initialize_model, set_hyperparams, load_dataset, train, save

# Path to your training data file

training_data_path = 'path_to_your_training_data.txt'

# Load the dataset and split into training and validation sets (default split: 85% training, 15% validation)

dataset = load_dataset(training_data_path)

# set custom hyperparams

3 This will assign custom values to default hyperparams



# Initialize the GPT-2 model with custom hyperparameters and train on the dataset

model = initialize_model(vocab_size=dataset.vocab_size) # keep record of the 'vocab_size' to use later.

# Train the model on the training data and validate it on the validation data

trained_model = train(model, dataset.train_data, dataset.val_data)

# Save the trained model to a file

save('', trained_model)

Generate text

# initiate context

# generate the new tokens and decode them into text

Generate text with prompt

# initiate context

# Give prompt

prompt = "To be or not to be, that is the question."

# encode prompt into tokens

prompt_encoded = torch.tensor(dataset._encode(prompt), dtype=torch.long, device=device)
prompt_encoded = prompt_encoded.unsqueeze(0)

# initiate context with prompt

context_with_prompt =, prompt_encoded), dim=1)

# generate the new tokens and decode them into text



This project is designed for educational purposes to help you learn and understand how generative AI works. It demonstrates the implementation of a GPT-2 variant, a popular generative language model, and provides a hands-on experience in training language models on custom datasets. By exploring the code and running the provided examples, you can gain insights into the underlying principles of generative AI and natural language processing.

Please note that while this project serves as a learning resource, it is essential to be mindful of the ethical considerations and potential risks associated with AI models. The use of AI technology, especially in generating text, should be done responsibly and with respect for ethical guidelines.


This project is licensed under the MIT License.


Here are some links for better understanding.

  1. Attention Is All You Need link here

  2. Textbooks Are All You Need Link here

  3. GPT(Generative Pre-trained Transformers) Link here

  4. Language Models are Unsupervised Multitask Learners Link here

  5. GPT-2 Link here

Code is completely open source so you can play around it and get a better understanding of it.

GPT-3 Hyperparameter are not publicly available yet but you can use your own hyperparameter to train a gpt-3 like base model.

How GPT works

Generative Pre-trained Transformer (GPT) models, such as GPT-2, are designed to predict the next word or token in a sequence of text. They achieve this by leveraging the power of self-attention mechanisms within the Transformer architecture.

At a high level, here's how GPT models work to predict the next word or token:

  1. Language Modeling: GPT models are trained on large amounts of text data using a process called language modeling. During training, the model learns the statistical patterns and dependencies present in the input text. It captures the relationships between different words and the likelihood of one word following another in a sentence.

  2. Contextual Understanding: The key feature of GPT models is their ability to understand the context in which a word or token appears. Instead of treating each word in isolation, the model considers the entire sequence of tokens and assigns higher importance to the tokens that are most relevant for understanding the current word.

  3. Self-Attention: Self-attention is the mechanism that allows GPT models to weigh the importance of different tokens in the input sequence based on their contextual relevance. Tokens that are semantically related or have strong dependencies receive higher attention scores, while unrelated tokens receive lower attention scores.

  4. Autoregressive Generation: To predict the next word or token in a sequence, GPT models use an autoregressive generation process. This means that the model predicts each token one by one, conditioned on the tokens that came before it in the sequence. The previously predicted tokens serve as context to guide the prediction of the next token.

  5. Text Completion: By repeatedly predicting the next token and using the predicted tokens as context, the GPT model can complete a given prompt or generate entirely new text. The generated text is coherent and contextually appropriate because the model has learned from vast amounts of text data during pre-training.

  6. Adaptability: GPT models are highly adaptable and can be fine-tuned on specific tasks or domains after pre-training. This process allows them to perform well on a wide range of natural language processing tasks, such as text generation, machine translation, question answering, and more.

In summary, GPT models are powerful language models that predict the next word or token in a sequence by understanding the context and dependencies between words. Their ability to generate coherent and contextually relevant text makes them valuable tools for various natural language processing applications.

How we trick them to chat with user

As these models are designed to complete given documents, they are not capable of engaging in real-time conversations with users. If you present them with a question, they will simply generate more questions that resemble the input. To overcome this limitation, we use a specific format for interaction.

For instance, we frame the conversation as if it's between a human and an AI. This format helps the AI understand its role and context better. Here's an example prompt:

"AI: What can I do for you today? \nHuman: ..."

In this format, the human provides their input after "Human:", and the AI responds accordingly. The entire conversation is treated as a single document, with the AI's responses generated by the model and the human's responses provided by an actual person.

This approach allows us to harness the power of the language model to generate creative and contextually relevant AI responses, while human input ensures that the conversation remains meaningful and coherent.

By using this interaction format, we can effectively leverage the capabilities of the language model in completing text documents, making it a useful tool for various natural language processing tasks.

