Skip to main content

A PyTorch implementation of transformer-based language models including GPT architecture for pretraining and fine-tuning

Project description

Language Modeling using Transformers (LMT)

Python PyTorch License Code style: ruff

A PyTorch implementation of transformer-based language models including GPT architecture for pretraining and fine-tuning. This project is designed for educational and research purposes to help users understand how the attention mechanism and Transformer architecture work in Large Language Models (LLMs).

๐Ÿš€ Features

  • GPT Architecture: Complete implementation of decoder-only transformer models
  • Attention Mechanisms: Multi-head self-attention with causal masking
  • Tokenization: Multiple tokenizer implementations (BPE, Naive)
  • Training Pipeline: Comprehensive trainer with pretraining and fine-tuning support
  • Educational Focus: Well-documented code for learning transformer internals
  • Modern Stack: Built with PyTorch 2.7+, Python 3.11+

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.11 or 3.12
  • PyTorch 2.7+

Install from PyPI

pip install language-modeling-transformers

Install from GitHub

pip install git+https://github.com/michaelellis003/LMT.git

๐Ÿƒโ€โ™‚๏ธ Quick Start

Basic Model Usage

from lmt import GPT, ModelConfig
from lmt.models.config import ModelConfigPresets
import torch

# Create a small GPT model
config = ModelConfigPresets.small_gpt()
model = GPT(config)

# Generate some text
input_ids = torch.randint(0, config.vocab_size, (1, 10))
with torch.no_grad():
    logits = model(input_ids)
    print(f"Output shape: {logits.shape}")  # (1, 10, vocab_size)

Training a Model

from lmt import Trainer, GPT
from lmt.training import BaseTrainingConfig
from lmt.models.config import ModelConfigPresets

# Configure model and training
model_config = ModelConfigPresets.small_gpt()
training_config = BaseTrainingConfig(
    num_epochs=10,
    batch_size=4,
    learning_rate=1e-4
)

# Initialize model and trainer
model = GPT(model_config)
trainer = Trainer(
    model=model,
    train_loader=your_train_loader,
    val_loader=your_val_loader,
    config=training_config
)

# Start training
trainer.train()

Using the Training Script

# Pretraining
python scripts/train.py --task pretraining --num_epochs 20 --batch_size 4

# Classification fine-tuning
python scripts/train.py --task classification --download_model --learning_rate 1e-5

๐Ÿ“š Documentation

Model Components

  • GPT: Main model class implementing decoder-only transformer
  • TransformerBlock: Individual transformer layer with attention and feed-forward
  • MultiHeadAttention: Multi-head self-attention mechanism
  • CausalAttention: Attention with causal masking for autoregressive generation

Tokenizers

  • BPETokenizer: Byte-Pair Encoding tokenizer
  • NaiveTokenizer: Simple character-level tokenizer
  • BaseTokenizer: Abstract base class for custom tokenizers

Training

  • Trainer: Main training orchestrator with support for pretraining and fine-tuning
  • BaseTrainingConfig: Configuration class for training parameters
  • Custom datasets and dataloaders: Support for various text datasets

๐Ÿ—‚๏ธ Project Structure

src/lmt/
โ”œโ”€โ”€ __init__.py              # Main package exports
โ”œโ”€โ”€ models/                  # Model architectures
โ”‚   โ”œโ”€โ”€ gpt/                # GPT implementation
โ”‚   โ”œโ”€โ”€ config.py           # Model configuration
โ”‚   โ””โ”€โ”€ utils.py            # Model utilities
โ”œโ”€โ”€ layers/                  # Neural network layers
โ”‚   โ”œโ”€โ”€ attention/          # Attention mechanisms
โ”‚   โ””โ”€โ”€ transformers/       # Transformer blocks
โ”œโ”€โ”€ tokenizer/              # Tokenization implementations
โ”œโ”€โ”€ training/               # Training pipeline
โ””โ”€โ”€ generate.py             # Text generation utilities

scripts/
โ”œโ”€โ”€ train.py                # Main training script
โ””โ”€โ”€ utils.py                # Training utilities

tests/                      # Comprehensive test suite
notebooks/                  # Educational Jupyter notebooks
docs/                       # Sphinx documentation

๐Ÿ“Š Examples and Notebooks

Explore the interactive notebooks in the notebooks/ directory:

  • attention.ipynb: Understanding attention mechanisms
  • pretraining_gpt.ipynb: GPT pretraining walkthrough
  • tokenizer.ipynb: Tokenization techniques

๐Ÿ”ง Configuration

Model Configuration

from lmt.models.config import ModelConfig

config = ModelConfig(
    vocab_size=50257,
    embed_dim=768,
    context_length=1024,
    num_layers=12,
    num_heads=12,
    dropout=0.1
)

Training Configuration

from lmt.training.config import BaseTrainingConfig

training_config = BaseTrainingConfig(
    num_epochs=10,
    batch_size=8,
    learning_rate=3e-4,
    weight_decay=0.1,
    print_every=100,
    eval_every=500
)

๐Ÿ“„ License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylmt-0.2.10.tar.gz (24.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pylmt-0.2.10-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file pylmt-0.2.10.tar.gz.

File metadata

  • Download URL: pylmt-0.2.10.tar.gz
  • Upload date:
  • Size: 24.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pylmt-0.2.10.tar.gz
Algorithm Hash digest
SHA256 93b4a721d706ddf1a21ffa06a78e36a7a022f64f51eb0b7d4c9e6129d49f5816
MD5 21972ffa65d27d1f5df32ac10e1fd60a
BLAKE2b-256 eccb31338efc9ca87781f8cbaac136d3805fae5e669624ce77eab378dbc94c3f

See more details on using hashes here.

File details

Details for the file pylmt-0.2.10-py3-none-any.whl.

File metadata

  • Download URL: pylmt-0.2.10-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pylmt-0.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 9a572b1f18c36e4ee1ad92809018be02bf6087d61b329bc5ede568a71275d573
MD5 27c64a7470937b5b292928187bf09160
BLAKE2b-256 9f675e0279e78e159457c4d2d76e3d3c13f7963f1b1966b622ee412a5e8fda21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page