Count and truncate whisper text based on tokens
Project description
wtok: ttok for Whisper one sentence at a time
Count and truncate text based on tokens one sentence at a time
Background
Whisper models conditional distributions of a token given a sequence of past tokens
This tool can count tokens, using OpenAI's tiktoken library.
It can also truncate text to a specified number of tokens.
Installation
Install this tool using pip
:
pip install wtok
Counting tokens
Provide text as arguments to this tool to count tokens:
wtok Hello world
2
You can also pipe text into the tool:
echo -n "Hello world" | wtok
2
Here the echo -n
option prevents echo from adding a newline - without that you would get a token count of 3.
To pipe in text and then append extra tokens from arguments, use the -i -
option:
echo -n "Hello world" | wtok more text -i -
6
Different models
By default, the tokenizer model for GPT-3.5 and GPT-4 is used.
To use the model for GPT-2 and GPT-3, add --model gpt2
:
wtok boo Hello there this is -m gpt2
6
Compared to GPT-3.5:
wtok boo Hello there this is
5
Further model options are documented here.
Truncating text
Use the -t 10
or --truncate 10
option to truncate text to a specified number of tokens:
wtok This is too many tokens -t 3
This is too
Viewing tokens
The --encode
option can be used to view the integer token IDs for the incoming text:
wtok Hello world --encode
9906 1917
The --decode
method reverses this process:
wtok 9906 1917 --decode
Hello world
Add --tokens
to either of these options to see a detailed breakdown of the tokens:
wtok Hello world --encode --tokens
[b'Hello', b' world']
Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd wtok
python -m venv venv
source venv/bin/activate
Now install for editing:
pip install -e .
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.