Skip to main content

🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨

Project description

AutoTikTokenizer Logo

AutoTikTokenizer

PyPI version Downloads Package size License Documentation Last Commit GitHub Stars

🚀 Accelerate your HuggingFace tokenizers by converting them to TikToken format with AutoTikTokenizer - get TikToken's speed while keeping HuggingFace's flexibility.

FeaturesInstallationExamplesSupported ModelsBenchmarksCitation

Key Features

  • 🚀 High Performance - Built on TikToken's efficient tokenization engine
  • 🔄 HuggingFace Compatible - Seamless integration with the HuggingFace ecosystem
  • 📦 Lightweight - Minimal dependencies, just TikToken and Huggingface-hub
  • 🎯 Easy to Use - Simple, intuitive API that works out of the box
  • 💻 Well Tested - Comprehensive test suite across supported models

Installation

Install autotiktokenizer from PyPI via the following command:

pip install autotiktokenizer

You can also install it from source, by the following command:

pip install git+https://github.com/bhavnicksm/autotiktokenizer

Examples

This section provides a basic usage example of the project. Follow these simple steps to get started quickly.

# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer

# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)

# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)

Supported Models

AutoTikTokenizer current supports the following models (and their variants) out of the box, with support for other models to be tested and added soon!

  • GPT2
  • GPT-J Family
  • SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
  • LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
  • Deepseek Family: Deepseek-v2.5 etc
  • Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
  • Mistral Family: Mistral-7B-Instruct-v0.3 etc
  • BERT Family: BERT, RoBERTa, MiniLM, TinyBERT, DeBERTa etc.

NOTE: Some models use the unigram tokenizers, which are not supported with TikToken and hence, 🧰 AutoTikTokenizer cannot convert the tokenizers for such models. Some models that use unigram tokenizers include T5, ALBERT, Marian and XLNet.

Benchmarks

Benchmarking results for tokenizing 1 billion tokens from fineweb-edu dataset using Llama 3.2 tokenizer on CPU (Google colab)

Configuration Processing Type AutoTikTokenizer HuggingFace Speed Ratio
Single Thread Sequential 14:58 (898s) 40:43 (2443s) 2.72x faster
Batch x1 Batched 15:58 (958s) 10:30 (630s) 0.66x slower
Batch x4 Batched 8:00 (480s) 10:30 (630s) 1.31x faster
Batch x8 Batched 6:32 (392s) 10:30 (630s) 1.62x faster
4 Processes Parallel 2:34 (154s) 8:59 (539s) 3.50x faster

The above table shows that AutoTikTokenizer's tokenizer (TikToken) is actually way faster than HuggingFace's Tokenizer by 1.6-3.5 times under fair comparison! While, it's not making the most optimal use of TikToken (yet), its still way faster than the stock solutions you might be getting otherwise.

Acknowledgement

Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!

If you found this repository useful, give it a ⭐️! Thank You :)

Citation

If you use autotiktokenizer in your research, please cite it as follows:

@misc{autotiktokenizer,
    author = {Bhavnick Minhas},
    title = {AutoTikTokenizer},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autotiktokenizer-0.2.0.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

autotiktokenizer-0.2.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file autotiktokenizer-0.2.0.tar.gz.

File metadata

  • Download URL: autotiktokenizer-0.2.0.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for autotiktokenizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 85138daf8025fcf5b1b260cad66b055aad89f79cfd876330bdf8b1b64ca4f1f2
MD5 a016de744502d4504b42506d3c4dd90b
BLAKE2b-256 0404416514d9abb11b137a86b747b71f30cf4542f4d838937708b592854871cc

See more details on using hashes here.

Provenance

File details

Details for the file autotiktokenizer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for autotiktokenizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 acb286e0b9878ce49c4bea1c523d6563d01175be83d862e66f3fd58e98acd026
MD5 a4d19c0aa921c8ccd1499b5cf72866cc
BLAKE2b-256 7c00814ed7ac4365ff12fd071f66cee3f0e95bb13ece0a44eb67b0c175074f7b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page