Lightweight GPT implementation designed for resource-constrained environments

These details have not been verified by PyPI

Project links

Project description

microGPT 🚀

Lightweight GPT implementation designed for resource-constrained environments

🎯 Overview

microGPT is a character-level implementation of GPT (Generative Pre-trained Transformer) language models, inspired by NanoGPT but following the same design philosophy as microBERT: significantly reducing computational resource requirements while maintaining model performance.

🔤 Key Design Choice: Character-Level Tokenization Unlike traditional GPT models that use subword or word-level tokenization, microGPT operates at the character level - each token represents a single character (A-Z, a-z, 0-9, punctuation, spaces). This approach:

Simplifies the architecture and reduces vocabulary size (~65 characters vs. 50k+ subword tokens)
Eliminates the need for complex tokenization schemes
Makes the model more interpretable and easier to debug
Requires longer sequences but provides finer-grained text generation control

✨ Key Features

🎯 Lightweight Design

Model Compression: Significantly reduced parameter count through carefully designed architecture
Computational Optimization: Flash Attention support for improved inference efficiency
Memory Efficient: Optimized for resource-constrained environments
Character-Level Simplicity: Minimal vocabulary size (~65 characters) eliminates complex tokenization overhead

🚀 Resource Adaptation

Mobile-Friendly: Runs on laptops, embedded devices, and mobile platforms
Fast Training: Supports rapid prototyping and experimentation
Flexible Configuration: Adjustable model size based on hardware resources

🏗️ Architecture

Core Components

Transformer Blocks: Standard self-attention + MLP architecture
Flash Attention: Efficient attention computation for PyTorch 2.0+
Weight Tying: Token embedding and output layer weight sharing
Layer Normalization: Optional bias support
Character-Level Tokenization: Direct character-to-integer mapping without complex tokenization schemes

Default Configuration (Lightweight)

n_layer = 6      # 6 Transformer layers
n_head = 6       # 6 attention heads
n_embd = 384     # 384-dimensional embeddings
block_size = 256 # 256 token context window

🚀 Quick Start

Package-Based Training

microGPT is designed to work as a standalone package. After installation, you can train models from any directory without needing the source code.

📊 Dataset Preparation

microGPT comes with a built-in Shakespeare dataset for character-level language modeling. The dataset preparation script and raw text data are included in the package for easy access. The dataset preparation process:

Uses the Shakespeare text included in the package
Tokenizes characters into integers (vocabulary size: ~65 characters)
Splits data into training (90%) and validation (10%) sets
Saves processed data in ./data/shakespeare_char/ relative to your current working directory

📦 Installation

Development Installation (for contributors)

# Clone the repository first
git clone https://github.com/henrywoo/microgpt.git
cd microgpt

# Install in editable mode for development
pip install -e .

Production Installation (for users)

# Install directly from PyPI
pip install microgpt

🎓 Training

Complete Training Workflow

You can train microGPT from any directory using the installed package:

# 1. Prepare the dataset (creates ./data/shakespeare_char/)
python -m microgpt.prepare_dataset

# 2. Start training (uses the prepared dataset)
python -m microgpt.pretrain.clm_pretrain_v0

No git repo checkout required! After installation, you can run training from anywhere.

1. Prepare the Dataset

First, prepare the Shakespeare dataset for character-level language modeling:

# From any directory where you want to store the data
python -m microgpt.prepare_dataset

This script will:

Uses the Shakespeare text included in the package
Tokenizes characters into integers (vocabulary size: ~65 characters)
Splits data into training (90%) and validation (10%) sets
Saves processed data in ./data/shakespeare_char/ relative to your current working directory
Shows the exact path where data is saved for easy reference

🔤 Character-Level Tokenization Details:

Each character (A-Z, a-z, 0-9, punctuation, space) gets a unique integer ID
No complex tokenization schemes like BPE (Byte Pair Encoding) or WordPiece
Direct character-to-integer mapping: 'H' → 72, 'e' → 101, 'l' → 108, 'l' → 108, 'o' → 111
Vocabulary size is typically ~65 characters for English text

2. Start Training

# From any directory, run the training script directly from the package
python -m microgpt.pretrain.clm_pretrain_v0

The training script will automatically:

Load the prepared dataset from ./data/shakespeare_char/ (relative to current directory)
Initialize the microGPT model with default configuration
Train using the specified hyperparameters
Save checkpoints and generate sample text

🔌 API Usage

import torch
from microgpt.model import MicroGPT, MicroGPTConfig

# Create configuration
config = MicroGPTConfig(
    n_layer=6,
    n_head=6, 
    n_embd=384,
    block_size=256,
    vocab_size=65  # Must match the vocabulary size in meta.pkl (typically ~65 for character-level)
)

# Initialize model
model = MicroGPT(config)

# Generate text
# Note: For meaningful text generation, the model should be trained first
# This example shows the structure, but untrained models will generate random text
# Input should be character-level token IDs (e.g., encoded text from meta.pkl)
generated = model.generate(
    idx=torch.tensor([[1, 2, 3]]), 
    max_new_tokens=50,
    temperature=0.8
)

# Decode the generated text (converts character-level token IDs back to characters)
generated_text = MicroGPT.decode_text(generated[0])
print(f"Generated text: {generated_text}")

🎭 Sampling from Trained Models

After training a model, you can generate text samples using the sample.py script. This script loads a trained checkpoint and generates text based on your specifications. The sampling process works at the character level, generating one character at a time based on the learned character-level patterns.

🚀 Basic Usage

# Generate samples from a trained model
python -m microgpt.sample

Acknowledgments

NanoGPT - Original codebase
Hugging Face Transformers - Implementation reference
microBERT - Design philosophy inspiration for lightweight architecture

Contact

For questions or suggestions, please:

Submit an Issue
Email: [wufuheng@gmail.com]

microGPT - Making AI lighter, deployment simpler 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Aug 10, 2025

0.0.2.dev1 pre-release

Aug 10, 2025

0.0.2.dev0 pre-release

Aug 3, 2025

0.0.1

Aug 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

microgpt-0.0.2-py3-none-any.whl (459.3 kB view details)

Uploaded Aug 10, 2025 Python 3

File details

Details for the file microgpt-0.0.2-py3-none-any.whl.

File metadata

Download URL: microgpt-0.0.2-py3-none-any.whl
Upload date: Aug 10, 2025
Size: 459.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.10.13

File hashes

Hashes for microgpt-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`48851edde0b4d40ab6ff7844f9e2e24cd1a383a436e64d56bd57cddcbfa3bd28`
MD5	`72d12c590e418fc4ef64ee7b0b27ed84`
BLAKE2b-256	`b72844dd71e8438d8d5cbfd826743b6ac0c2487e7290eb4798a5946f366c2674`

See more details on using hashes here.

microgpt 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

microGPT 🚀

🎯 Overview

✨ Key Features

🎯 Lightweight Design

🚀 Resource Adaptation

🏗️ Architecture

Core Components

Default Configuration (Lightweight)

🚀 Quick Start

Package-Based Training

📊 Dataset Preparation

📦 Installation

Development Installation (for contributors)

Production Installation (for users)

🎓 Training

Complete Training Workflow

1. Prepare the Dataset

2. Start Training

🔌 API Usage

🎭 Sampling from Trained Models

🚀 Basic Usage

Acknowledgments

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes