Count number of tokens in the text file using toktoken tokenizer from OpenAI.
Project description
Count tokens
Simple tool that have one purpose - count tokens in a text file.
Requirements
This package is using tiktoken library for tokenization.
Installation
For usage from comman line install the package in isolated environement with pipx:
$ pipx install count-tokens
or install it in your current environment with pip.
Usage
Open terminal and run:
$ count-tokens document.txt
You should see something like this:
File: document.txt
Encoding: cl100k_base
Number of tokens: 67
if you want to see just the tokens count run:
$ count-tokens document.txt --quiet
and the output will be:
67
NOTE: tiktoken
supports three encodings used by OpenAI models:
Encoding name | OpenAI models |
---|---|
cl100k_base |
gpt-4 , gpt-3.5-turbo , text-embedding-ada-002 |
p50k_base |
Codex models, text-davinci-002 , text-davinci-003 |
r50k_base (or gpt2 ) |
GPT-3 models like davinci |
to use token-count with other than default cl100k_base
encoding use the additional input argument -e
or --encoding
:
$ count-tokens document.txt -e r50k_base
Approximate number of tokens
In case you need the results a bit faster and you don't need the exact number of tokens you can use the --approx
parameter with w
to have approximation based on number of words or c
to have approximation based on number of characters.
$ count-tokens document.txt --approx w
It is based on assumption that there is 4/3 (1 and 1/3) tokens per word and 4 characters per token.
## Programmatic usage
```python
from count_tokens.count import count_tokens_in_file
num_tokens = count_tokens_in_file("document.txt")
from count_tokens.count import count_tokens_in_string
num_tokens = count_tokens_in_string("This is a string.")
for both functions you can use encoding
parameter to specify the encoding used by the model:
from count_tokens.count import count_tokens_in_string
num_tokens = count_tokens_in_string("This is a string.", encoding="cl100k_base")
Default value for encoding
is cl100k_base
.
Related Projects
- tiktoken - tokenization library used by this package
Credits
Thanks to the authors of the tiktoken library for open sourcing their work.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file count_tokens-0.7.0.tar.gz
.
File metadata
- Download URL: count_tokens-0.7.0.tar.gz
- Upload date:
- Size: 3.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.5 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 040c6c24295c6176b9d8ce02ad5485ebc6cbbf30f0cc86eee5651d7cbdf2081a |
|
MD5 | 19192ba778e5ffe008082aece6cef9a4 |
|
BLAKE2b-256 | a388edef8993c8bfac8f0e70e4adc1fb8ecf9b81c06227d960e2ceb68c5c7a0f |
File details
Details for the file count_tokens-0.7.0-py3-none-any.whl
.
File metadata
- Download URL: count_tokens-0.7.0-py3-none-any.whl
- Upload date:
- Size: 3.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.5 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a3a52d6714fd1b9ebb595a202059c3f184ef06f0317a6bd1d48a7dda5db9e01 |
|
MD5 | 4293cb46653c62e4acbe342b9df4ea12 |
|
BLAKE2b-256 | ea58f504216b7f6aebf132ea9b130a945ed2ad3e4e0ee37ad82c918ecc97374e |