Lightweight GPT implementation designed for resource-constrained environments
Project description
microGPT 🚀
Lightweight GPT implementation designed for resource-constrained environments
🎯 Overview
microGPT is a character-level implementation of GPT (Generative Pre-trained Transformer) language models, inspired by NanoGPT but following the same design philosophy as microBERT: significantly reducing computational resource requirements while maintaining model performance.
🔤 Key Design Choice: Character-Level Tokenization Unlike traditional GPT models that use subword or word-level tokenization, microGPT operates at the character level - each token represents a single character (A-Z, a-z, 0-9, punctuation, spaces). This approach:
- Simplifies the architecture and reduces vocabulary size (~65 characters vs. 50k+ subword tokens)
- Eliminates the need for complex tokenization schemes
- Makes the model more interpretable and easier to debug
- Requires longer sequences but provides finer-grained text generation control
✨ Key Features
🎯 Lightweight Design
- Model Compression: Significantly reduced parameter count through carefully designed architecture
- Computational Optimization: Flash Attention support for improved inference efficiency
- Memory Efficient: Optimized for resource-constrained environments
- Character-Level Simplicity: Minimal vocabulary size (~65 characters) eliminates complex tokenization overhead
🚀 Resource Adaptation
- Mobile-Friendly: Runs on laptops, embedded devices, and mobile platforms
- Fast Training: Supports rapid prototyping and experimentation
- Flexible Configuration: Adjustable model size based on hardware resources
🏗️ Architecture
Core Components
- Transformer Blocks: Standard self-attention + MLP architecture
- Flash Attention: Efficient attention computation for PyTorch 2.0+
- Weight Tying: Token embedding and output layer weight sharing
- Layer Normalization: Optional bias support
- Character-Level Tokenization: Direct character-to-integer mapping without complex tokenization schemes
Default Configuration (Lightweight)
n_layer = 6 # 6 Transformer layers
n_head = 6 # 6 attention heads
n_embd = 384 # 384-dimensional embeddings
block_size = 256 # 256 token context window
🚀 Quick Start
Package-Based Training
microGPT is designed to work as a standalone package. After installation, you can train models from any directory without needing the source code.
📊 Dataset Preparation
microGPT comes with a built-in Shakespeare dataset for character-level language modeling. The dataset preparation script and raw text data are included in the package for easy access. The dataset preparation process:
- Uses the Shakespeare text included in the package
- Tokenizes characters into integers (vocabulary size: ~65 characters)
- Splits data into training (90%) and validation (10%) sets
- Saves processed data in
./data/shakespeare_char/relative to your current working directory
📦 Installation
Development Installation (for contributors)
# Clone the repository first
git clone https://github.com/henrywoo/microgpt.git
cd microgpt
# Install in editable mode for development
pip install -e .
Production Installation (for users)
# Install directly from PyPI
pip install microgpt
🎓 Training
Complete Training Workflow
You can train microGPT from any directory using the installed package:
# 1. Prepare the dataset (creates ./data/shakespeare_char/)
python -m microgpt.prepare_dataset
# 2. Start training (uses the prepared dataset)
python -m microgpt.pretrain.clm_pretrain_v0
No git repo checkout required! After installation, you can run training from anywhere.
1. Prepare the Dataset
First, prepare the Shakespeare dataset for character-level language modeling:
# From any directory where you want to store the data
python -m microgpt.prepare_dataset
This script will:
- Uses the Shakespeare text included in the package
- Tokenizes characters into integers (vocabulary size: ~65 characters)
- Splits data into training (90%) and validation (10%) sets
- Saves processed data in
./data/shakespeare_char/relative to your current working directory - Shows the exact path where data is saved for easy reference
🔤 Character-Level Tokenization Details:
- Each character (A-Z, a-z, 0-9, punctuation, space) gets a unique integer ID
- No complex tokenization schemes like BPE (Byte Pair Encoding) or WordPiece
- Direct character-to-integer mapping:
'H' → 72, 'e' → 101, 'l' → 108, 'l' → 108, 'o' → 111 - Vocabulary size is typically ~65 characters for English text
2. Start Training
# From any directory, run the training script directly from the package
python -m microgpt.pretrain.clm_pretrain_v0
The training script will automatically:
- Load the prepared dataset from
./data/shakespeare_char/(relative to current directory) - Initialize the microGPT model with default configuration
- Train using the specified hyperparameters
- Save checkpoints and generate sample text
🔌 API Usage
import torch
from microgpt.model import MicroGPT, MicroGPTConfig
# Create configuration
config = MicroGPTConfig(
n_layer=6,
n_head=6,
n_embd=384,
block_size=256,
vocab_size=65 # Must match the vocabulary size in meta.pkl (typically ~65 for character-level)
)
# Initialize model
model = MicroGPT(config)
# Generate text
# Note: For meaningful text generation, the model should be trained first
# This example shows the structure, but untrained models will generate random text
# Input should be character-level token IDs (e.g., encoded text from meta.pkl)
generated = model.generate(
idx=torch.tensor([[1, 2, 3]]),
max_new_tokens=50,
temperature=0.8
)
# Decode the generated text (converts character-level token IDs back to characters)
generated_text = MicroGPT.decode_text(generated[0])
print(f"Generated text: {generated_text}")
🎭 Sampling from Trained Models
After training a model, you can generate text samples using the sample.py script. This script loads a trained checkpoint and generates text based on your specifications. The sampling process works at the character level, generating one character at a time based on the learned character-level patterns.
🚀 Basic Usage
# Generate samples from a trained model
python -m microgpt.sample
Acknowledgments
- NanoGPT - Original codebase
- Hugging Face Transformers - Implementation reference
- microBERT - Design philosophy inspiration for lightweight architecture
Contact
For questions or suggestions, please:
- Submit an Issue
- Email: [wufuheng@gmail.com]
microGPT - Making AI lighter, deployment simpler 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file microgpt-0.0.2-py3-none-any.whl.
File metadata
- Download URL: microgpt-0.0.2-py3-none-any.whl
- Upload date:
- Size: 459.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48851edde0b4d40ab6ff7844f9e2e24cd1a383a436e64d56bd57cddcbfa3bd28
|
|
| MD5 |
72d12c590e418fc4ef64ee7b0b27ed84
|
|
| BLAKE2b-256 |
b72844dd71e8438d8d5cbfd826743b6ac0c2487e7290eb4798a5946f366c2674
|