Skip to main content

A GPT-4 compatible Byte Pair Encoding (BPE) tokenizer.

Project description

SmolBPE

PyPI Version PyPI - Downloads GitHub Stars License Python Versions GitHub code size in bytes Sponsor Twitter Follow Made with Love

Overview

SmolBPE is a lightweight and efficient Byte Pair Encoding (BPE) tokenizer designed for deep learning applications and large language models (LLMs) such as GPT-4. It provides a simple interface to tokenize textual data, facilitating better handling of out-of-vocabulary words and improving the performance of language models.

Features

  • Efficient Tokenization: Implements the BPE algorithm for effective subword tokenization.
  • Customizable Vocabulary Size: Allows you to specify the desired vocabulary size according to your needs.
  • Unicode Support: Handles a wide range of characters, including Unicode characters, enabling multilingual tokenization.
  • Easy Integration: Designed for seamless integration with existing Python projects and NLP pipelines.
  • Command-Line Interface: Provides a CLI tool for training and using the tokenizer without writing additional code.
  • Open Source: Licensed under the MIT License, promoting openness and collaboration.

Installation

You can install SmolBPE using pip:

pip install smolbpe

Alternatively, you can install it directly from the source code:

git clone https://github.com/T4ras123/SmolBPE.git
cd SmolBPE
pip install .

Quick Start Guide

Using the Tokenizer in Python

1.Importing the Tokenizer

from smolbpe.gpt4Tokenizer import GPT4Tokenizer

2.Initializing the Tokenizer

tokenizer = GPT4Tokenizer()

You can specify a custom output file to save the vocab file to and regex pattern if needed:

tokenizer = GPT4Tokenizer(output='vocab.json', pattern=r"\p{L}+|\p{Z}+|\p{N}+|[\p{P}&&[^.]]")

3.Training the Tokenizer

Train the tokenizer on your dataset to build the vocabulary and merge rules:

with open("path_to_your_data", "r", encoding="utf-8") as f:
    text = f.read()

tokenizer.train(text, vocab_size=400)

4.Encoding Text

Convert text into a list of token IDs:

encoded_tokens = tokenizer.encode("Tokenizing isn't real")
print(encoded_tokens)

5.Decoding Tokens

Convert token IDs back into human-readable text:

decoded_text = tokenizer.decode(encoded_tokens)
print(decoded_text)

Command-Line Interface

SmolBPE provides a command-line interface for easy tokenization tasks.

Training the Tokenizer

gpt4tokenizer --text data/taylorswift.txt --vocab_size 400 --output vocab.json

Advanced Usage

Loading a Pre-trained Vocabulary

If you have a pre-trained vocabulary and merges file, you can load them directly:

tokenizer = GPT4Tokenizer()
tokenizer.load_vocab('vocab.json')

Custom Regex Pattern

Customize the tokenization by providing a different regex pattern:

custom_pattern = r"\w+|\s+|[^\s\w]+"
tokenizer = GPT4Tokenizer(pattern=custom_pattern)

Project Structure

SmolBPE/
├── smolbpe/
│   ├── __init__.py
│   └── gpt4Tokenizer.py
├── LICENSE
├── MANIFEST.in
├── README.md
└── setup.py

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository on GitHub.
  2. Create a new branch for your feature or bug fix.
  3. Commit your changes with descriptive commit messages.
  4. Push your branch to your forked repository.
  5. Open a pull request on the main repository.

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

License

This project is licensed under the MIT License. You are free to use, modify, and distribute this software in accordance with the license.

Contact

For any inquiries or feedback, please contact the author:

Acknowledgments

  • Inspired by tokenization techniques used in GPT models.
  • Special thanks to the open-source community for continuous support.

Happy tokenizing with SmolBPE!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smolbpe-0.3.1.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

smolbpe-0.3.1-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file smolbpe-0.3.1.tar.gz.

File metadata

  • Download URL: smolbpe-0.3.1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for smolbpe-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b8d4cda1ce7a23810dc2f4fe4297af539ba9d58cf13e70f61426ba96d3292723
MD5 4512ba1b7b14c31837225b672787d0a6
BLAKE2b-256 5c9075a4f1bf7e07e9684e744ab5a80874556866045c45931b442b7a401a2bfe

See more details on using hashes here.

File details

Details for the file smolbpe-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: smolbpe-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for smolbpe-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9d517262d272b9169ad5f9e193f2556989f1277d9da4226d08be88056f5a5c5a
MD5 fe845e41b3e7b2985b79ee695f3dc16c
BLAKE2b-256 93e7c94568d1de6b8565f6a8377f43c31f3cf0d03a78cd5651911043e37d9a7c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page