Skip to main content

ToCount: Lightweight Token Estimator

Project description

ToCount Logo

ToCount: Lightweight Token Estimator


PyPI version built with Python3 GitHub repo size Discord Channel

Overview

ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.

PyPI Counter
Github Stars
Branch main dev
CI
Code Quality CodeFactor

Installation

PyPI

Source code

Models

Rule-Based

Model Name MAE RMSE MedAE
RULE_BASED.UNIVERSAL 0.8175 106.70 617.78 18 0.6377
RULE_BASED.GPT_3_5 0.7266 152.34 756.17 35 0.4828
RULE_BASED.GPT_4 0.6878 161.93 808.04 40 0.4502

Tiktoken R50K

Model Name MAE RMSE MedAE
TIKTOKEN_R50K.LINEAR_ALL 0.7334 152.39 733.40 28.55 0.4826
TIKTOKEN_R50K.LINEAR_ENGLISH 0.8703 62.76 508.20 8.87 0.7287

Tiktoken CL100K

Model Name MAE RMSE MedAE
TIKTOKEN_CL100K.LINEAR_ALL 0.9127 64.09 298.02 15.73 0.6804
TIKTOKEN_CL100K.LINEAR_ENGLISH 0.9711 27.43 185.07 6.34 0.8527

Tiktoken O200K

Model Name MAE RMSE MedAE
TIKTOKEN_O200K.LINEAR_ALL 0.9563 38.23 197.16 9.70 0.7818
TIKTOKEN_O200K.LINEAR_ENGLISH 0.9730 26.00 177.54 5.96 0.8581

Deepseek R1

Model Name MAE RMSE MedAE
DEEPSEEK_R1.LINEAR_ALL 0.9531 40.66 212.11 10.71 0.7741
DEEPSEEK_R1.LINEAR_ENGLISH 0.9696 28.44 192.36 6.36 0.8477

Qwen QwQ

Model Name MAE RMSE MedAE
QWEN_QWQ.LINEAR_ALL 0.9342 45.50 257.97 12.17 0.7542
QWEN_QWQ.LINEAR_ENGLISH 0.9570 29.06 236.10 6.68 0.8457

Llama 3.1

Model Name MAE RMSE MedAE
LLAMA_3_1.LINEAR_ALL 0.9538 44.37 207.58 11.70 0.7578
LLAMA_3_1.LINEAR_ENGLISH 0.9731 26.59 177.94 6.24 0.8564

ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

Usage

>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to tocount@openscilab.com.

  • Please complete the issue template

You can also join our discord server

Discord Channel

References

1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.
2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

ToCount Donation

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

0.5 - 2026-01-02

Added

  • DEEPSEEK_R1.LINEAR_ALL model
  • DEEPSEEK_R1.LINEAR_ENGLISH model
  • QWEN_QWQ.LINEAR_ALL model
  • QWEN_QWQ.LINEAR_ENGLISH model
  • LLAMA_3_1.LINEAR_ALL model
  • LLAMA_3_1.LINEAR_ENGLISH model

Changed

  • README.md updated

0.4 - 2025-12-17

Added

  • Logo

Changed

  • TIKTOKEN_CL100K.LINEAR_ALL model updated
  • TIKTOKEN_CL100K.LINEAR_ENGLISH model updated
  • TIKTOKEN_O200K.LINEAR_ALL model updated
  • TIKTOKEN_O200K.LINEAR_ENGLISH model updated
  • TIKTOKEN_R50K.LINEAR_ALL model updated
  • TIKTOKEN_R50K.LINEAR_ENGLISH model updated

0.3 - 2025-10-21

Added

  • TIKTOKEN_CL100K.LINEAR_ALL model
  • TIKTOKEN_CL100K.LINEAR_ENGLISH model
  • TIKTOKEN_O200K.LINEAR_ALL model
  • TIKTOKEN_O200K.LINEAR_ENGLISH model

Changed

  • README.md updated
  • Python 3.14 added to test.yml

0.2 - 2025-10-02

Added

  • TIKTOKEN_R50K.LINEAR_ALL model
  • TIKTOKEN_R50K.LINEAR_ENGLISH model

Changed

  • README.md updated

0.1 - 2025-08-30

Added

  • RULE_BASED.UNIVERSAL model
  • RULE_BASED.GPT_4 model
  • RULE_BASED.GPT_3_5 model

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tocount-0.5.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tocount-0.5-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file tocount-0.5.tar.gz.

File metadata

  • Download URL: tocount-0.5.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for tocount-0.5.tar.gz
Algorithm Hash digest
SHA256 7b2eda67538576c665dc76001d35ccab99414897b4850f29f469d0196952d299
MD5 6122d741ba6afceced8fe459b23836eb
BLAKE2b-256 863c43c8bc8fe6ca455b39b1d60518661c49c7a14260661bcad9a4f4ba9311d4

See more details on using hashes here.

File details

Details for the file tocount-0.5-py3-none-any.whl.

File metadata

  • Download URL: tocount-0.5-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for tocount-0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a5a6ed223e757a4570a99aec393628a9188052eaf8ee086cf2579ac2a7f5c035
MD5 df43d9029def0f2267010ff85927be67
BLAKE2b-256 961d1ffdabc806d36f108fd618561bdd0f015556a91b1315b3a09beac34ad52c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page