Skip to main content

🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨

Project description

AutoTikTokenizer Logo

AutoTikTokenizer

A great way to leverage the speed and lightweight of OpenAI's TikToken with the universal support of HuggingFace's Tokenizers. Now, you can run ANY tokenizer at 3-6x the speed out of the box!

Quick Install and Use

Install autotiktokenizer from PyPI via the following command:

pip install autotiktokenizer

And just run it in a couple of easy steps,

# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer

# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)

# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)

Supported Models

AutoTikTokenizer current supports the following models (and their variants) out of the box, with support for other models to be tested and added soon!

  • GPT2
  • LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
  • SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
  • GPT-J Family
  • Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
  • Deepseek Family: Deepseek-v2.5 etc
  • Mistral Family: Mistral-7B-Instruct-v0.3

Acknoledgement

Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!

If you found this repository useful, I would appriciate if you could star this repository and boost it on Socials so a greater audience could benefit from this. Thank you so much! :)

Citation

If you use autotiktokenizer in your research, please cite it as follows:

@misc{autotiktokenizer,
    author = {Bhavnick Minhas},
    title = {AutoTikTokenizer},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autotiktokenizer-0.1.0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

autotiktokenizer-0.1.0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file autotiktokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: autotiktokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for autotiktokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7da7d954d0dbc7fe1c069c7acd38c252fca8ce6a8b9cb49a46bdd76e818a4e90
MD5 bb5ee48dd9ca2cc5705762227e348757
BLAKE2b-256 d4cd513819132aa6045f4d40ab13577553db3af8daa62c82aab1d01306e296d3

See more details on using hashes here.

Provenance

File details

Details for the file autotiktokenizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for autotiktokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 311e703fbef5c4046e90fb53532a888287a5e4bd27b7c3fd74c0698428802987
MD5 19af532e8832629df3d49402316a8464
BLAKE2b-256 860895787da4efa101c04558aa362e13a16947f5dc11f3d16a207f490e875949

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page