Skip to main content

🧰 The AutoTokenizer that TikToken always needed -- Load any tokenizer with TikToken now! ✨

Project description

AutoTikTokenizer

A great way to leverage the speed and lightweight of OpenAI's TikToken with the universal support of HuggingFace's Tokenizers. Now, you can run ANY tokenizer at 3-6x the speed out of the box!

Quick Install and Use

Install autotiktokenizer from PyPI via the following command:

pip install autotiktokenizer

And just run it in a couple of easy steps,

# step 1: Import the library
from autotiktokenizer import AutoTikTokenizer

# step 2: Load the tokenizer
tokenizer = AutoTikTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# step 3: Enjoy the Inferenece speed 🏎️
text = "Wow! I never thought I'd be able to use Llama on TikToken"
encodings = tokenizer.encode(text)

# (Optional) step 4: Decode the outputs
text = tokenizer.decode(encodings)

Supported Models

AutoTikTokenizer current supports the following models (and their variants) out of the box, with support for other models to be tested and added soon!

  • GPT2
  • LLaMa 3 Family: LLama-3.2-1B-Instruct, LLama-3.2-3B-Instruct, LLama-3.1-8B-Instruct etc.
  • SmolLM Family: Smollm2-135M, Smollm2-350M, Smollm2-1.5B etc.
  • GPT-J Family
  • Gemma2 Family: Gemma2-2b-It, Gemma2-9b-it etc
  • Deepseek Family: Deepseek-v2.5 etc
  • Mistral Family: Mistral-7B-Instruct-v0.3

Acknoledgement

Special thanks to HuggingFace and OpenAI for making their respective open-source libraries that make this work possible. I hope that they would continue to support the developer ecosystem for LLMs in the future!

If you found this repository useful, I would appriciate if you could star this repository and boost it on Socials so a greater audience could benefit from this. Thank you so much! :)

Citation

If you use autotiktokenizer in your research, please cite it as follows:

@misc{autotiktokenizer,
    author = {Bhavnick Minhas},
    title = {AutoTikTokenizer},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/bhavnicksm/autotiktokenizer}},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autotiktokenizer-0.0.1a0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

autotiktokenizer-0.0.1a0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file autotiktokenizer-0.0.1a0.tar.gz.

File metadata

  • Download URL: autotiktokenizer-0.0.1a0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for autotiktokenizer-0.0.1a0.tar.gz
Algorithm Hash digest
SHA256 f11def48e6915631599d7fcd447cff125f9e678ff055bf58b0e4395b1ea44f96
MD5 2d0da4d7a9884baf36058a9804edd0be
BLAKE2b-256 acb0ebed75793e449eb296439e35f45a7ec3801b514495d3badb60aafe73fc71

See more details on using hashes here.

Provenance

File details

Details for the file autotiktokenizer-0.0.1a0-py3-none-any.whl.

File metadata

File hashes

Hashes for autotiktokenizer-0.0.1a0-py3-none-any.whl
Algorithm Hash digest
SHA256 9d8b41a9b2fa93d564a34b062dc37ee63021bba05a6684bb7351de98002ccd81
MD5 5e9983e456b8801278c2b5512c0cb4e0
BLAKE2b-256 2627acb258ca836a28f724d8eefc0044da21a06e30280dc3971e9b2ea0408304

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page