Skip to main content

Count number of tokens in the text file using toktoken tokenizer from OpenAI.

Project description

Count tokens

img

Simple tool that have one purpose - count tokens in a text file.

Requirements

This package is using tiktoken library for tokenization.

Installation

For usage from comman line install the package in isolated environement with pipx:

$ pipx install count-tokens

or install it in your current environment with pip.

Usage

Open terminal and run:

$ count-tokens document.txt

You should see something like this:

File: document.txt
Encoding: cl100k_base
Number of tokens: 67

if you want to see just the tokens count run:

$ count-tokens document.txt --quiet

and the output will be:

67

NOTE: tiktoken supports three encodings used by OpenAI models:

Encoding name OpenAI models
cl100k_base gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2) GPT-3 models like davinci

to use token-count with other than default cl100k_base encoding use the additional input argument -e or --encoding:

$ count-tokens document.txt -e r50k_base

Approximate number of tokens

In case you need the results a bit faster and you don't need the exact number of tokens you can use the --approx parameter with w to have approximation based on number of words or c to have approximation based on number of characters.

$ count-tokens document.txt --approx w

It is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.

## Programmatic usage

```python
from count_tokens.count import count_tokens_in_file

num_tokens = count_tokens_in_file("document.txt")
from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.")

for both functions you can use encoding parameter to specify the encoding used by the model:

from count_tokens.count import count_tokens_in_string

num_tokens = count_tokens_in_string("This is a string.", encoding="cl100k_base")

Default value for encoding is cl100k_base.

Related Projects

  • tiktoken - tokenization library used by this package

Credits

Thanks to the authors of the tiktoken library for open sourcing their work.

License

MIT © Krystian Safjan.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

count_tokens-0.7.0.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

count_tokens-0.7.0-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file count_tokens-0.7.0.tar.gz.

File metadata

  • Download URL: count_tokens-0.7.0.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.5 Darwin/22.6.0

File hashes

Hashes for count_tokens-0.7.0.tar.gz
Algorithm Hash digest
SHA256 040c6c24295c6176b9d8ce02ad5485ebc6cbbf30f0cc86eee5651d7cbdf2081a
MD5 19192ba778e5ffe008082aece6cef9a4
BLAKE2b-256 a388edef8993c8bfac8f0e70e4adc1fb8ecf9b81c06227d960e2ceb68c5c7a0f

See more details on using hashes here.

File details

Details for the file count_tokens-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: count_tokens-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.5 Darwin/22.6.0

File hashes

Hashes for count_tokens-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2a3a52d6714fd1b9ebb595a202059c3f184ef06f0317a6bd1d48a7dda5db9e01
MD5 4293cb46653c62e4acbe342b9df4ea12
BLAKE2b-256 ea58f504216b7f6aebf132ea9b130a945ed2ad3e4e0ee37ad82c918ecc97374e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page