Skip to main content

A CLI tool for token counting and analysis using OpenAI's tiktoken

Project description

Tokker CLI

An open-source, locally-run tool for CLI (later also for Alfred / Aircast on macOS) that performs tokenization of text using OpenAI’s tiktoken (later also HuggingFace transformers library).


Features

  • Token Counting: Accurate token count using OpenAI's tiktoken library
  • Multiple Tokenizers: Support for cl100k_base (GPT-4) and o200k_base (GPT-4o) tokenizers
  • Flexible Output: JSON, plain text, and summary output formats
  • Configuration: Persistent configuration for default tokenizer and delimiter settings
  • Text Analysis: Word count, character count, and token frequency analysis
  • Cross-platform: Works on Windows, macOS, and Linux

Setup

TBD


Usage

Get full output

$ tokker --text 'hello world'
{
  'converted': 'hello⎮ world',
  'token_strings': ['hello', ' world'],
  'token_ids': [15339, 1917],
  'token_count': 2,
  'word_count': 2,
  'char_count': 11,
  'pivot': {
    'hello': 1,
    ' world': 1
  },
  'tokenizer': 'cl100k_base'
}

Get plain (delimited) text

$ tokker --text 'hello world' --format plain
Hello⎮ world

Get summary

$ tokker --text 'hello world' --format summary
{
  'token_count': 2,
  'word_count': 2,
  'char_count': 11,
  'tokenizer': 'o200k_base'
}

Run specific (non-default) tokenizer

$ tokker --text 'hello world' --tokenizer o200k_base

Set defaul tokenizer

# Set default tokenizer
$ tokker --set-default-tokenizer o200k_base
✓ Default tokenizer set to: o200k_base
Configuration saved to: /home/user/.config/tokker/tokenizer_config.json

Configuration

Tokker stores configuration in ~/.config/tokker/tokenizer_config.json:

{
  'default_tokenizer': 'cl100k_base',
  'delimiter': '⎮'
}
  • default_tokenizer: Default tokenizer to use (cl100k_base or o200k_base)
  • delimiter: Character used to separate tokens in plain text output

Tokenizers

cl100k_base

  • Used by: GPT-4, GPT-3.5-turbo
  • Description: OpenAI's standard tokenizer for GPT-4 models
  • Vocabulary size: ~100,000 tokens

o200k_base

  • Used by: GPT-4o, GPT-4o-mini
  • Description: Newer tokenizer with improved efficiency
  • Vocabulary size: ~200,000 tokens

Project Structure

tokker/
├── tokker/
│   ├── __main__.py          # Entry point for python -m tokker
│   └── cli/
│       ├── __init__.py
│       ├── config.py        # Configuration management
│       ├── tokenize.py      # Main CLI interface
│       └── utils.py         # Core tokenization utilities
├── tests/                   # Test suite
├── README.md               # This file
├── LICENSE                 # MIT License
├── pyproject.toml          # Project configuration
└── requirements.txt        # Dependencies

License

This project is licensed under the MIT License - see the LICENSE file for details.


Changelog

v0.1.0 (Initial Release)

  • Basic tokenization functionality
  • Support for cl100k_base and o200k_base tokenizers
  • JSON, plain text, and summary output formats
  • Configuration management
  • Command-line interface

Acknowledgments

  • OpenAI for the tiktoken library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokker-0.1.0.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokker-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file tokker-0.1.0.tar.gz.

File metadata

  • Download URL: tokker-0.1.0.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for tokker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b16c2a80eb7bc0ecee53d049e02398881a988e84ba4ab6c0b09a42ef0496f18c
MD5 4739e341521b840568727cf5d70312d3
BLAKE2b-256 285283ad24e54b8777e87472d1675c737bee11726ccb8bef7ce5beae4303cd44

See more details on using hashes here.

File details

Details for the file tokker-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tokker-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for tokker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ba604143fed918a88127f8470d77f527cf1f35810615a509656c4633f3338df
MD5 9c218d0c972314127e8005098e2bb955
BLAKE2b-256 a56e5acb962458538a79d2c773f224fa455f420ce9555f607c44c2f9beb20d63

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page