🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨
Project description
AutoTikTokenizer
A great way to leverage the speed and lightweight of OpenAI's TikToken with the universal support of HuggingFace's Tokenizers. Now, you can run ANY tokenizer at 3-6x the speed out of the box!
Quick Install and Use
Install autotiktokenizer
from PyPI via the following command:
pip install autotiktokenizer
And just run it in a couple of easy steps,
# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer
# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)
# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)
Supported Models
AutoTikTokenizer current supports the following models (and their variants) out of the box, with support for other models to be tested and added soon!
- GPT2
- LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
- SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
- GPT-J Family
- Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
- Deepseek Family: Deepseek-v2.5 etc
- Mistral Family: Mistral-7B-Instruct-v0.3
Acknoledgement
Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!
If you found this repository useful, I would appriciate if you could star this repository and boost it on Socials so a greater audience could benefit from this. Thank you so much! :)
Citation
If you use autotiktokenizer
in your research, please cite it as follows:
@misc{autotiktokenizer,
author = {Bhavnick Minhas},
title = {AutoTikTokenizer},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file autotiktokenizer-0.1.1.tar.gz
.
File metadata
- Download URL: autotiktokenizer-0.1.1.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6929013560cf754e18835e746aa674fd1e02ed24e6aac927c7085be5785f42ca |
|
MD5 | c9071b81c350a9b2a7014ad087a066f9 |
|
BLAKE2b-256 | 4d3f66ee4d81a0d572c78e996fc3adc5bd7661966640f12c1efb6d07681c805f |
Provenance
File details
Details for the file autotiktokenizer-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: autotiktokenizer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb2d2128be3717cada40161d649a4789c8d976192b04bf382d06f0a71ef8b38e |
|
MD5 | 73017357a662d27060c4f77952c0f616 |
|
BLAKE2b-256 | c7bb0ba82793d816854b15565d6a93806c0743c697e8afbf79e356b6da817bed |