🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨
Project description
AutoTikTokenizer
🚀 Accelerate your HuggingFace tokenizers by converting them to TikToken format with AutoTikTokenizer - get TikToken's speed while keeping HuggingFace's flexibility.
Features • Installation • Examples • Supported Models • Benchmarks • Citation
Key Features
- 🚀 High Performance - Built on TikToken's efficient tokenization engine
- 🔄 HuggingFace Compatible - Seamless integration with the HuggingFace ecosystem
- 📦 Lightweight - Minimal dependencies, just TikToken and Huggingface-hub
- 🎯 Easy to Use - Simple, intuitive API that works out of the box
- 💻 Well Tested - Comprehensive test suite across supported models
Installation
Install autotiktokenizer
from PyPI via the following command:
pip install autotiktokenizer
You can also install it from source, by the following command:
pip install git+https://github.com/bhavnicksm/autotiktokenizer
Examples
This section provides a basic usage example of the project. Follow these simple steps to get started quickly.
# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer
# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)
# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)
Supported Models
AutoTikTokenizer current supports the following models (and their variants) out of the box, with support for other models to be tested and added soon!
- GPT2
- GPT-J Family
- SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
- LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
- Deepseek Family: Deepseek-v2.5 etc
- Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
- Mistral Family: Mistral-7B-Instruct-v0.3 etc
- BERT Family: BERT, RoBERTa, MiniLM, TinyBERT, DeBERTa etc.
NOTE: Some models use the unigram tokenizers, which are not supported with TikToken and hence, 🧰 AutoTikTokenizer cannot convert the tokenizers for such models. Some models that use unigram tokenizers include T5, ALBERT, Marian and XLNet.
Benchmarks
Benchmarking results for tokenizing 1 billion tokens from fineweb-edu dataset using Llama 3.2 tokenizer on CPU (Google colab)
Configuration | Processing Type | AutoTikTokenizer | HuggingFace | Speed Ratio |
---|---|---|---|---|
Single Thread | Sequential | 14:58 (898s) | 40:43 (2443s) | 2.72x faster |
Batch x1 | Batched | 15:58 (958s) | 10:30 (630s) | 0.66x slower |
Batch x4 | Batched | 8:00 (480s) | 10:30 (630s) | 1.31x faster |
Batch x8 | Batched | 6:32 (392s) | 10:30 (630s) | 1.62x faster |
4 Processes | Parallel | 2:34 (154s) | 8:59 (539s) | 3.50x faster |
The above table shows that AutoTikTokenizer's tokenizer (TikToken) is actually way faster than HuggingFace's Tokenizer by 1.6-3.5 times under fair comparison! While, it's not making the most optimal use of TikToken (yet), its still way faster than the stock solutions you might be getting otherwise.
Acknowledgement
Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!
If you found this repository useful, give it a ⭐️! Thank You :)
Citation
If you use autotiktokenizer
in your research, please cite it as follows:
@misc{autotiktokenizer,
author = {Bhavnick Minhas},
title = {AutoTikTokenizer},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file autotiktokenizer-0.2.0.tar.gz
.
File metadata
- Download URL: autotiktokenizer-0.2.0.tar.gz
- Upload date:
- Size: 12.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85138daf8025fcf5b1b260cad66b055aad89f79cfd876330bdf8b1b64ca4f1f2 |
|
MD5 | a016de744502d4504b42506d3c4dd90b |
|
BLAKE2b-256 | 0404416514d9abb11b137a86b747b71f30cf4542f4d838937708b592854871cc |
Provenance
File details
Details for the file autotiktokenizer-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: autotiktokenizer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | acb286e0b9878ce49c4bea1c523d6563d01175be83d862e66f3fd58e98acd026 |
|
MD5 | a4d19c0aa921c8ccd1499b5cf72866cc |
|
BLAKE2b-256 | 7c00814ed7ac4365ff12fd071f66cee3f0e95bb13ece0a44eb67b0c175074f7b |