Count number of tokens in the text file using toktoken tokenizer from OpenAI.
Project description
Count tokens
Simple tool that have one purpose - count tokens in a text file.
Requirements
This package is using tiktoken library for tokenization.
Installation
For usage from comman line install the package in isolated environement with pipx:
$ pipx install count-tokens
or install it in your current environment with pip.
Usage
Open terminal and run:
$ count-tokens document.txt
You should see something like this:
File: document.txt
Encoding: cl100k_base
Number of tokens: 67
if you want to see just the tokens count run:
$ count-tokens document.txt --quiet
and the output will be:
67
NOTE: tiktoken
supports three encodings used by OpenAI models:
Encoding name | OpenAI models |
---|---|
cl100k_base |
gpt-4 , gpt-3.5-turbo , text-embedding-ada-002 |
p50k_base |
Codex models, text-davinci-002 , text-davinci-003 |
r50k_base (or gpt2 ) |
GPT-3 models like davinci |
to use token-count with other than default cl100k_base
encoding use the additional input argument -e
or --encoding
:
$ count-tokens document.txt -e r50k_base
Related Projects
- tiktoken - tokenization library used by this package
Credits
Thanks to the authors of the tiktoken library for open sourcing their work.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for count_tokens-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 14d3db2356a7c39d6e3ba13f345ee05d5fe6907d1a44bd7a35999ca0f6ef7851 |
|
MD5 | ac7ca0714af473e9a8f38d5b751a6c10 |
|
BLAKE2b-256 | 6f9fc76de1e98a39e7795de10e6a40f569251453b830becd2437d77626f58471 |