Skip to main content

A small-scale transformer-based language model implemented from scratch in Python.

Project description

ScratchGPT

ScratchGPT

PyPI version Tests Status Lint Status License Python versions

ScratchGPT is a Python project that implements a small-scale transformer-based language model from scratch. It is designed for educational purposes, allowing developers to explore the internals of a transformer model without the complexity of large-scale frameworks. The project provides functionality for training the model on custom datasets and generating text from a prompt.

Why?

We want to allow people to experiment easily with any sequence-to-sequence problems. This package is simple to understand, simple to use - show us your projects using ScratchGPT.

Features

  • Custom transformer architecture implementation
  • Training on user-provided text data
  • Text generation using the trained model
  • Command-line interfaces for training and inference

Key Features

  • Custom Transformer Architecture: A from-the-ground-up implementation of a decoder-only transformer, including Multi-Head Self-Attention , Feed-Forward layers, and Layer Normalization.
  • Flexible Tokenization: Includes a simple character-level tokenizer and a wrapper for using any tokenizer from the Hugging Face Hub.
  • Configurable Training: Easily configure model architecture (e.g., embedding_size, num_heads) and training parameters (e.g., learning_rate, batch_size) via a scratch_gpt.yaml file.
  • Command-Line Interfaces: Comes with user-friendly CLIs for both training the model and performing inference.
  • Pre-tokenization Caching: Caches tokenized datasets to disk for significantly faster startup on subsequent training runs.

Requirements

  • Python 3.12+
  • uv for dependency management

Installation

  1. Clone the repository:

    git clone https://github.com/LabStrangeLoop/scratchgpt.git
    cd scratchgpt
    
  2. Install dependencies using uv:

    uv sync --all-groups
    
  3. Install from pip:

    pip install scratchgpt
    

Full Usage Examples

Please take a look at the simple example in the examples folder.

Usage

Training

To train the model on your custom dataset, run the train command. This will create an experiment folder containing the model weights, tokenizer files, and configuration.

uv run train -t <path_to_training_data> -e <experiment_folder>
  • -d, --data_source: Path to the training data file or folder
  • -e, --experiment: Path to the folder where experiment checkpoints will be saved
  • -t, --tokenizer: (Optional) The Hugging Face Hub tokenizer to use (default: "gpt2")

Inference

To generate text using a trained model, use infer command:

uv run infer -e <experiment_folder> [-dv <device>] [-m <max_tokens>]
  • -e, --experiment: Path to the folder containing the trained model
  • -dv, --device: Device to run the model on (default: "cuda")
  • -m, --max_tokens: Maximum number of tokens to generate (default: 512)

Tokenization

This project allows you to create your own tokenizers easily or bootstraps huggingface tokenizers for you to use.

Project Structure

The repository is organized to separate concerns, making it easy to navigate.

  • scratchgpt/train.py: Main training script.
  • scratchgpt/infer.py: Inference script for text generation.
  • scratchgpt/config.py: Contains all Pydantic configuration models.
  • scratchgpt/model/model.py: The core Transformer model implementation.
  • scratchgpt/training/trainer.py: Orchestrates the training and validation loops.
  • scratchgpt/tokenizer/: Tokenizer implementations, including wrappers for Hugging Face.
  • scratchgpt/model_io.py: Utilities for saving and loading models and tokenizers.
  • tests/: Unit tests for the project.

Development

This project uses various development tools:

  • mypy for static type checking
  • ruff for formatting and standard adherence
  • pytest for testing

Run the following commands to ensure code quality:

uv run ruff --fix .
uv run mypy scratchgpt
uv run pytest ./tests/

Future Roadmap

  • Apply SOTA optimizations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License

Authors

  • Aleksandr Yeganov
  • Dario Cazzani

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scratchgpt-0.4.0-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file scratchgpt-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: scratchgpt-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.14

File hashes

Hashes for scratchgpt-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a4f9c6d6f6091421b4a7528d2a427b41dd4744ca1f85d354b219d7d9bec8edb7
MD5 9f906697cfe7d471de00af149a51fed4
BLAKE2b-256 765de8bf6321259c9895ffac721da60075ce0f44a27df14aee03ee7abca5814d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page