A CLI tool for token counting and analysis using OpenAI's tiktoken
Project description
Tokker CLI
An open-source, locally-run tool for CLI (later also for Alfred / Aircast on macOS) that performs tokenization of text using OpenAI’s tiktoken (later also HuggingFace transformers library).
Features
- Token Counting: Accurate token count using OpenAI's tiktoken library
- Multiple Tokenizers: Support for
cl100k_base(GPT-4) ando200k_base(GPT-4o) tokenizers - Flexible Output: JSON, plain text, and summary output formats
- Configuration: Persistent configuration for default tokenizer and delimiter settings
- Text Analysis: Word count, character count, and token frequency analysis
- Cross-platform: Works on Windows, macOS, and Linux
Setup
TBD
Usage
Get full output
$ tokker --text 'hello world'
{
'converted': 'hello⎮ world',
'token_strings': ['hello', ' world'],
'token_ids': [15339, 1917],
'token_count': 2,
'word_count': 2,
'char_count': 11,
'pivot': {
'hello': 1,
' world': 1
},
'tokenizer': 'cl100k_base'
}
Get plain (delimited) text
$ tokker --text 'hello world' --format plain
Hello⎮ world
Get summary
$ tokker --text 'hello world' --format summary
{
'token_count': 2,
'word_count': 2,
'char_count': 11,
'tokenizer': 'o200k_base'
}
Run specific (non-default) tokenizer
$ tokker --text 'hello world' --tokenizer o200k_base
Set defaul tokenizer
# Set default tokenizer
$ tokker --set-default-tokenizer o200k_base
✓ Default tokenizer set to: o200k_base
Configuration saved to: /home/user/.config/tokker/tokenizer_config.json
Configuration
Tokker stores configuration in ~/.config/tokker/tokenizer_config.json:
{
'default_tokenizer': 'cl100k_base',
'delimiter': '⎮'
}
default_tokenizer: Default tokenizer to use (cl100k_baseoro200k_base)delimiter: Character used to separate tokens in plain text output
Tokenizers
cl100k_base
- Used by: GPT-4, GPT-3.5-turbo
- Description: OpenAI's standard tokenizer for GPT-4 models
- Vocabulary size: ~100,000 tokens
o200k_base
- Used by: GPT-4o, GPT-4o-mini
- Description: Newer tokenizer with improved efficiency
- Vocabulary size: ~200,000 tokens
Project Structure
tokker/
├── tokker/
│ ├── __main__.py # Entry point for python -m tokker
│ └── cli/
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── tokenize.py # Main CLI interface
│ └── utils.py # Core tokenization utilities
├── tests/ # Test suite
├── README.md # This file
├── LICENSE # MIT License
├── pyproject.toml # Project configuration
└── requirements.txt # Dependencies
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
v0.1.0 (Initial Release)
- Basic tokenization functionality
- Support for cl100k_base and o200k_base tokenizers
- JSON, plain text, and summary output formats
- Configuration management
- Command-line interface
Acknowledgments
- OpenAI for the tiktoken library
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokker-0.1.0.tar.gz.
File metadata
- Download URL: tokker-0.1.0.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b16c2a80eb7bc0ecee53d049e02398881a988e84ba4ab6c0b09a42ef0496f18c
|
|
| MD5 |
4739e341521b840568727cf5d70312d3
|
|
| BLAKE2b-256 |
285283ad24e54b8777e87472d1675c737bee11726ccb8bef7ce5beae4303cd44
|
File details
Details for the file tokker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tokker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ba604143fed918a88127f8470d77f527cf1f35810615a509656c4633f3338df
|
|
| MD5 |
9c218d0c972314127e8005098e2bb955
|
|
| BLAKE2b-256 |
a56e5acb962458538a79d2c773f224fa455f420ce9555f607c44c2f9beb20d63
|