A fast, simple CLI tool for tokenizing text using OpenAI's tiktoken library
Project description
Tokker
A fast, simple CLI tool for tokenizing text using OpenAI's tiktoken library. Get accurate token counts for GPT models with a single command.
Features
- Simple Usage: Just
tok 'your text'- that's it! - Multiple Tokenizers: Support for
o200k_base(GPT-4o) andcl100k_base(GPT-4) tokenizers - Flexible Output: JSON, plain text, and summary output formats
- Configuration: Persistent configuration for default tokenizer settings
- Text Analysis: Token count, word count, character count, and token frequency analysis
- Cross-platform: Works on Windows, macOS, and Linux
Installation
Install from PyPI with pip:
pip install tokker
That's it! The tok command is now available in your terminal.
Main commands
Quick Tips:
- Use single quotes to avoid shell interpretation:
tok 'Hello world!' - Pipe text from other commands:
echo "Hello world" | tok - Process files:
cat file.txt | tok --format summary - Chain with other tools:
curl -s https://example.com | tok - Set your preferred tokenizer once:
tok --set-default-tokenizer o200k_base
Full output
$ tok 'Hello world'
{
"converted": "Hello⎮ world",
"token_strings": ["Hello", " world"],
"token_ids": [24912, 2375],
"token_count": 2,
"word_count": 2,
"char_count": 11,
"pivot": {
"Hello": 1,
" world": 1
},
"tokenizer": "o200k_base"
}
Plain Text Output
$ tok 'Hello world' --format plain
Hello⎮ world
Summary Statistics
$ tok 'Hello world' --format summary
{
"token_count": 2,
"word_count": 2,
"char_count": 11,
"tokenizer": "o200k_base"
}
Other Commands
Using Different Tokenizers
$ tok 'Hello world' --tokenizer cl100k_base
Set Default Tokenizer:
$ tok --set-default-tokenizer o200k_base
✓ Default tokenizer set to: o200k_base
Configuration saved to: ~/.config/tokker/tokenizer_config.json
Other
usage: tok [-h] [--tokenizer {o200k_base,cl100k_base}]
[--format {json,plain,summary}]
[--set-default-tokenizer {o200k_base,cl100k_base}]
[text]
positional arguments:
text Text to tokenize (or read from stdin if not provided)
options:
--tokenizer Tokenizer to use (o200k_base, cl100k_base)
--format Output format (json, plain, summary)
--set-default-tokenizer Set default tokenizer
-h, --help Show help message
Tokenizers
- o200k_base (Default): used by GPT-4o, GPT-4o-mini; 200K vocab size
- cl100k_base: used by GPT-4, GPT-3.5; 100K vocab size
Configuration
Tokker stores your preferences in ~/.config/tokker/tokenizer_config.json:
{
"default_tokenizer": "o200k_base",
"delimiter": "⎮"
}
Programmatic Usage
You can also use tokker in your Python code:
import tokker
# Count tokens
count = tokker.count_tokens("Hello world")
print(f"Token count: {count}")
# Full tokenization
result = tokker.tokenize_text("Hello world", "o200k_base")
print(result["token_count"])
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Issues and pull requests are welcome! Visit the GitHub repository.
Acknowledgments
- OpenAI for the tiktoken library
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokker-0.1.2.tar.gz.
File metadata
- Download URL: tokker-0.1.2.tar.gz
- Upload date:
- Size: 9.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ea1529250db0bca1c41ffd612cbdc15dd276e1039a40328629aec65b90e2e56
|
|
| MD5 |
908bc43a9b8e84c3c145dc476f62f900
|
|
| BLAKE2b-256 |
b267e867b91f405a40d2f63a64ec58b09836896309b93b0d5f8778d92fbd0480
|
File details
Details for the file tokker-0.1.2-py3-none-any.whl.
File metadata
- Download URL: tokker-0.1.2-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31656be16fa063d82147791243ef5c3a6d630dd17ee09d983d6a521cf0262648
|
|
| MD5 |
6086c2fbc651c487b3b607c7a031be46
|
|
| BLAKE2b-256 |
d1bfa1b5518a95ba47ffb90cf5482ea5648396eb0af0914584bbb372075f5fe3
|