Skip to main content

Count and truncate whisper text based on tokens

Project description

wtok: ttok for Whisper

PyPI Changelog Tests License

Count and truncate text based on tokens

Background

Large language and speech models such as GPT-3.5 and GPT-4 work in terms of tokens.

This tool can count tokens, using OpenAI's tiktoken library.

It can also truncate text to a specified number of tokens.

Installation

Install this tool using pip:

pip install wtok

Counting tokens

Provide text as arguments to this tool to count tokens:

wtok Hello world
2

You can also pipe text into the tool:

echo -n "Hello world" | wtok
2

Here the echo -n option prevents echo from adding a newline - without that you would get a token count of 3.

To pipe in text and then append extra tokens from arguments, use the -i - option:

echo -n "Hello world" | wtok more text -i -
6

Different models

By default, the tokenizer model for GPT-3.5 and GPT-4 is used.

To use the model for GPT-2 and GPT-3, add --model gpt2:

wtok boo Hello there this is -m gpt2
6

Compared to GPT-3.5:

wtok boo Hello there this is
5

Further model options are documented here.

Truncating text

Use the -t 10 or --truncate 10 option to truncate text to a specified number of tokens:

wtok This is too many tokens -t 3
This is too

Viewing tokens

The --encode option can be used to view the integer token IDs for the incoming text:

wtok Hello world --encode
9906 1917

The --decode method reverses this process:

wtok 9906 1917 --decode
Hello world

Add --tokens to either of these options to see a detailed breakdown of the tokens:

wtok Hello world --encode --tokens
[b'Hello', b' world']

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd wtok
python -m venv venv
source venv/bin/activate

Now install for editing:

pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wtok-0.3.tar.gz (7.8 kB view hashes)

Uploaded Source

Built Distribution

wtok-0.3-py3-none-any.whl (8.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page