Skip to main content

🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨

Project description

AutoTikTokenizer Logo

AutoTikTokenizer

PyPI version Downloads License Documentation Last Commit GitHub Stars

A great way to leverage the speed and lightweight of OpenAI's TikToken with the universal support of HuggingFace's Tokenizers. Now, you can run ANY tokenizer at 3-6x the speed out of the box!

InstallationExamplesSupported ModelsCitation

Installation

Install autotiktokenizer from PyPI via the following command:

pip install autotiktokenizer

You can also install it from source, by the following command:

pip install git+https://github.com/bhavnicksm/autotiktokenizer

Examples

This section provides a basic usage example of the project. Follow these simple steps to get started quickly.

# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer

# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)

# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)

Supported Models

AutoTikTokenizer current supports the following models (and their variants) out of the box, with support for other models to be tested and added soon!

  • GPT2
  • GPT-J Family
  • SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
  • LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
  • Deepseek Family: Deepseek-v2.5 etc
  • Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
  • Mistral Family: Mistral-7B-Instruct-v0.3 etc
  • BERT Family: BERT, RoBERTa, MiniLM, TinyBERT, DeBERTa etc.

NOTE: Some models use the unigram tokenizers, which are not supported with TikToken and hence, 🧰 AutoTikTokenizer cannot convert the tokenizers for such models. Some models that use unigram tokenizers include T5, ALBERT, Marian and XLNet.

Acknowledgement

Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!

If you found this repository useful, give it a ⭐️! Thank You :)

Citation

If you use autotiktokenizer in your research, please cite it as follows:

@misc{autotiktokenizer,
    author = {Bhavnick Minhas},
    title = {AutoTikTokenizer},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autotiktokenizer-0.1.2.tar.gz (9.6 kB view hashes)

Uploaded Source

Built Distribution

autotiktokenizer-0.1.2-py3-none-any.whl (6.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page