Skip to main content

🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨

Project description

AutoTikTokenizer Logo

AutoTikTokenizer

A great way to leverage the speed and lightweight of OpenAI's TikToken with the universal support of HuggingFace's Tokenizers. Now, you can run ANY tokenizer at 3-6x the speed out of the box!

Quick Install and Use

Install autotiktokenizer from PyPI via the following command:

pip install autotiktokenizer

And just run it in a couple of easy steps,

# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer

# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)

# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)

Supported Models

AutoTikTokenizer current supports the following models (and their variants) out of the box, with support for other models to be tested and added soon!

  • GPT2
  • LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
  • SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
  • GPT-J Family
  • Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
  • Deepseek Family: Deepseek-v2.5 etc
  • Mistral Family: Mistral-7B-Instruct-v0.3

Acknoledgement

Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!

If you found this repository useful, I would appriciate if you could star this repository and boost it on Socials so a greater audience could benefit from this. Thank you so much! :)

Citation

If you use autotiktokenizer in your research, please cite it as follows:

@misc{autotiktokenizer,
    author = {Bhavnick Minhas},
    title = {AutoTikTokenizer},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autotiktokenizer-0.1.1.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

autotiktokenizer-0.1.1-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file autotiktokenizer-0.1.1.tar.gz.

File metadata

  • Download URL: autotiktokenizer-0.1.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for autotiktokenizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6929013560cf754e18835e746aa674fd1e02ed24e6aac927c7085be5785f42ca
MD5 c9071b81c350a9b2a7014ad087a066f9
BLAKE2b-256 4d3f66ee4d81a0d572c78e996fc3adc5bd7661966640f12c1efb6d07681c805f

See more details on using hashes here.

Provenance

File details

Details for the file autotiktokenizer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for autotiktokenizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb2d2128be3717cada40161d649a4789c8d976192b04bf382d06f0a71ef8b38e
MD5 73017357a662d27060c4f77952c0f616
BLAKE2b-256 c7bb0ba82793d816854b15565d6a93806c0743c697e8afbf79e356b6da817bed

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page