ToCount: Lightweight Token Estimator
Project description
Overview
ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.
| PyPI Counter |
|
| Github Stars |
|
| Branch | main | dev |
| CI |
|
|
| Code Quality |
Installation
PyPI
- Check Python Packaging User Guide
- Run
pip install tocount==0.5
Source code
- Download Version 0.5 or Latest Source
- Run
pip install .
Models
Rule-Based
| Model Name | R² | MAE | RMSE | MedAE | D² |
|---|---|---|---|---|---|
RULE_BASED.UNIVERSAL |
0.8175 | 106.70 | 617.78 | 18 | 0.6377 |
RULE_BASED.GPT_3_5 |
0.7266 | 152.34 | 756.17 | 35 | 0.4828 |
RULE_BASED.GPT_4 |
0.6878 | 161.93 | 808.04 | 40 | 0.4502 |
Tiktoken R50K
| Model Name | R² | MAE | RMSE | MedAE | D² |
|---|---|---|---|---|---|
TIKTOKEN_R50K.LINEAR_ALL |
0.7334 | 152.39 | 733.40 | 28.55 | 0.4826 |
TIKTOKEN_R50K.LINEAR_ENGLISH |
0.8703 | 62.76 | 508.20 | 8.87 | 0.7287 |
Tiktoken CL100K
| Model Name | R² | MAE | RMSE | MedAE | D² |
|---|---|---|---|---|---|
TIKTOKEN_CL100K.LINEAR_ALL |
0.9127 | 64.09 | 298.02 | 15.73 | 0.6804 |
TIKTOKEN_CL100K.LINEAR_ENGLISH |
0.9711 | 27.43 | 185.07 | 6.34 | 0.8527 |
Tiktoken O200K
| Model Name | R² | MAE | RMSE | MedAE | D² |
|---|---|---|---|---|---|
TIKTOKEN_O200K.LINEAR_ALL |
0.9563 | 38.23 | 197.16 | 9.70 | 0.7818 |
TIKTOKEN_O200K.LINEAR_ENGLISH |
0.9730 | 26.00 | 177.54 | 5.96 | 0.8581 |
Deepseek R1
| Model Name | R² | MAE | RMSE | MedAE | D² |
|---|---|---|---|---|---|
DEEPSEEK_R1.LINEAR_ALL |
0.9531 | 40.66 | 212.11 | 10.71 | 0.7741 |
DEEPSEEK_R1.LINEAR_ENGLISH |
0.9696 | 28.44 | 192.36 | 6.36 | 0.8477 |
Qwen QwQ
| Model Name | R² | MAE | RMSE | MedAE | D² |
|---|---|---|---|---|---|
QWEN_QWQ.LINEAR_ALL |
0.9342 | 45.50 | 257.97 | 12.17 | 0.7542 |
QWEN_QWQ.LINEAR_ENGLISH |
0.9570 | 29.06 | 236.10 | 6.68 | 0.8457 |
Llama 3.1
| Model Name | R² | MAE | RMSE | MedAE | D² |
|---|---|---|---|---|---|
LLAMA_3_1.LINEAR_ALL |
0.9538 | 44.37 | 207.58 | 11.70 | 0.7578 |
LLAMA_3_1.LINEAR_ENGLISH |
0.9731 | 26.59 | 177.94 | 6.24 | 0.8564 |
ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].
Usage
>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4
Issues & bug reports
Just fill an issue and describe it. We'll check it ASAP! or send an email to tocount@openscilab.com.
- Please complete the issue template
You can also join our discord server
References
1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.
2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.
Show your support
Star this repo
Give a ⭐️ if this project helped you!
Donate to our project
If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased
0.5 - 2026-01-02
Added
DEEPSEEK_R1.LINEAR_ALLmodelDEEPSEEK_R1.LINEAR_ENGLISHmodelQWEN_QWQ.LINEAR_ALLmodelQWEN_QWQ.LINEAR_ENGLISHmodelLLAMA_3_1.LINEAR_ALLmodelLLAMA_3_1.LINEAR_ENGLISHmodel
Changed
README.mdupdated
0.4 - 2025-12-17
Added
- Logo
Changed
TIKTOKEN_CL100K.LINEAR_ALLmodel updatedTIKTOKEN_CL100K.LINEAR_ENGLISHmodel updatedTIKTOKEN_O200K.LINEAR_ALLmodel updatedTIKTOKEN_O200K.LINEAR_ENGLISHmodel updatedTIKTOKEN_R50K.LINEAR_ALLmodel updatedTIKTOKEN_R50K.LINEAR_ENGLISHmodel updated
0.3 - 2025-10-21
Added
TIKTOKEN_CL100K.LINEAR_ALLmodelTIKTOKEN_CL100K.LINEAR_ENGLISHmodelTIKTOKEN_O200K.LINEAR_ALLmodelTIKTOKEN_O200K.LINEAR_ENGLISHmodel
Changed
README.mdupdatedPython 3.14added totest.yml
0.2 - 2025-10-02
Added
TIKTOKEN_R50K.LINEAR_ALLmodelTIKTOKEN_R50K.LINEAR_ENGLISHmodel
Changed
README.mdupdated
0.1 - 2025-08-30
Added
RULE_BASED.UNIVERSALmodelRULE_BASED.GPT_4modelRULE_BASED.GPT_3_5model
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tocount-0.5.tar.gz.
File metadata
- Download URL: tocount-0.5.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b2eda67538576c665dc76001d35ccab99414897b4850f29f469d0196952d299
|
|
| MD5 |
6122d741ba6afceced8fe459b23836eb
|
|
| BLAKE2b-256 |
863c43c8bc8fe6ca455b39b1d60518661c49c7a14260661bcad9a4f4ba9311d4
|
File details
Details for the file tocount-0.5-py3-none-any.whl.
File metadata
- Download URL: tocount-0.5-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5a6ed223e757a4570a99aec393628a9188052eaf8ee086cf2579ac2a7f5c035
|
|
| MD5 |
df43d9029def0f2267010ff85927be67
|
|
| BLAKE2b-256 |
961d1ffdabc806d36f108fd618561bdd0f015556a91b1315b3a09beac34ad52c
|