Skip to main content

Lightweight GPT implementation designed for resource-constrained environments

Project description

microGPT 🚀

Lightweight GPT implementation designed for resource-constrained environments

Python PyTorch License

🎯 Overview

microGPT is a character-level implementation of GPT (Generative Pre-trained Transformer) language models, inspired by NanoGPT but following the same design philosophy as microBERT: significantly reducing computational resource requirements while maintaining model performance.

🔤 Key Design Choice: Character-Level Tokenization Unlike traditional GPT models that use subword or word-level tokenization, microGPT operates at the character level - each token represents a single character (A-Z, a-z, 0-9, punctuation, spaces). This approach:

  • Simplifies the architecture and reduces vocabulary size (~65 characters vs. 50k+ subword tokens)
  • Eliminates the need for complex tokenization schemes
  • Makes the model more interpretable and easier to debug
  • Requires longer sequences but provides finer-grained text generation control

✨ Key Features

🎯 Lightweight Design

  • Model Compression: Significantly reduced parameter count through carefully designed architecture
  • Computational Optimization: Flash Attention support for improved inference efficiency
  • Memory Efficient: Optimized for resource-constrained environments
  • Character-Level Simplicity: Minimal vocabulary size (~65 characters) eliminates complex tokenization overhead

🚀 Resource Adaptation

  • Mobile-Friendly: Runs on laptops, embedded devices, and mobile platforms
  • Fast Training: Supports rapid prototyping and experimentation
  • Flexible Configuration: Adjustable model size based on hardware resources

🏗️ Architecture

Core Components

  • Transformer Blocks: Standard self-attention + MLP architecture
  • Flash Attention: Efficient attention computation for PyTorch 2.0+
  • Weight Tying: Token embedding and output layer weight sharing
  • Layer Normalization: Optional bias support
  • Character-Level Tokenization: Direct character-to-integer mapping without complex tokenization schemes

Default Configuration (Lightweight)

n_layer = 6      # 6 Transformer layers
n_head = 6       # 6 attention heads
n_embd = 384     # 384-dimensional embeddings
block_size = 256 # 256 token context window

🚀 Quick Start

Package-Based Training

microGPT is designed to work as a standalone package. After installation, you can train models from any directory without needing the source code.

📊 Dataset Preparation

microGPT comes with a built-in Shakespeare dataset for character-level language modeling. The dataset preparation script and raw text data are included in the package for easy access. The dataset preparation process:

  1. Uses the Shakespeare text included in the package
  2. Tokenizes characters into integers (vocabulary size: ~65 characters)
  3. Splits data into training (90%) and validation (10%) sets
  4. Saves processed data in ./data/shakespeare_char/ relative to your current working directory

📦 Installation

Development Installation (for contributors)

# Clone the repository first
git clone https://github.com/henrywoo/microgpt.git
cd microgpt

# Install in editable mode for development
pip install -e .

Production Installation (for users)

# Install directly from PyPI
pip install microgpt

🎓 Training

Complete Training Workflow

You can train microGPT from any directory using the installed package:

# 1. Prepare the dataset (creates ./data/shakespeare_char/)
python -m microgpt.prepare_dataset

# 2. Start training (uses the prepared dataset)
python -m microgpt.pretrain.clm_pretrain_v0

No git repo checkout required! After installation, you can run training from anywhere.

1. Prepare the Dataset

First, prepare the Shakespeare dataset for character-level language modeling:

# From any directory where you want to store the data
python -m microgpt.prepare_dataset

This script will:

  • Uses the Shakespeare text included in the package
  • Tokenizes characters into integers (vocabulary size: ~65 characters)
  • Splits data into training (90%) and validation (10%) sets
  • Saves processed data in ./data/shakespeare_char/ relative to your current working directory
  • Shows the exact path where data is saved for easy reference

🔤 Character-Level Tokenization Details:

  • Each character (A-Z, a-z, 0-9, punctuation, space) gets a unique integer ID
  • No complex tokenization schemes like BPE (Byte Pair Encoding) or WordPiece
  • Direct character-to-integer mapping: 'H' → 72, 'e' → 101, 'l' → 108, 'l' → 108, 'o' → 111
  • Vocabulary size is typically ~65 characters for English text

2. Start Training

# From any directory, run the training script directly from the package
python -m microgpt.pretrain.clm_pretrain_v0

The training script will automatically:

  • Load the prepared dataset from ./data/shakespeare_char/ (relative to current directory)
  • Initialize the microGPT model with default configuration
  • Train using the specified hyperparameters
  • Save checkpoints and generate sample text

🔌 API Usage

import torch
from microgpt.model import MicroGPT, MicroGPTConfig

# Create configuration
config = MicroGPTConfig(
    n_layer=6,
    n_head=6, 
    n_embd=384,
    block_size=256,
    vocab_size=65  # Must match the vocabulary size in meta.pkl (typically ~65 for character-level)
)

# Initialize model
model = MicroGPT(config)

# Generate text
# Note: For meaningful text generation, the model should be trained first
# This example shows the structure, but untrained models will generate random text
# Input should be character-level token IDs (e.g., encoded text from meta.pkl)
generated = model.generate(
    idx=torch.tensor([[1, 2, 3]]), 
    max_new_tokens=50,
    temperature=0.8
)

# Decode the generated text (converts character-level token IDs back to characters)
generated_text = MicroGPT.decode_text(generated[0])
print(f"Generated text: {generated_text}")

🎭 Sampling from Trained Models

After training a model, you can generate text samples using the sample.py script. This script loads a trained checkpoint and generates text based on your specifications. The sampling process works at the character level, generating one character at a time based on the learned character-level patterns.

🚀 Basic Usage

# Generate samples from a trained model
python -m microgpt.sample

Acknowledgments

Contact

For questions or suggestions, please:


microGPT - Making AI lighter, deployment simpler 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

microgpt-0.0.2-py3-none-any.whl (459.3 kB view details)

Uploaded Python 3

File details

Details for the file microgpt-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: microgpt-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 459.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.13

File hashes

Hashes for microgpt-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 48851edde0b4d40ab6ff7844f9e2e24cd1a383a436e64d56bd57cddcbfa3bd28
MD5 72d12c590e418fc4ef64ee7b0b27ed84
BLAKE2b-256 b72844dd71e8438d8d5cbfd826743b6ac0c2487e7290eb4798a5946f366c2674

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page