Skip to main content

A fast, simple CLI tool for tokenizing text using OpenAI's tiktoken library and HuggingFace transformers

Project description

Tokker

A fast, simple CLI tool for tokenizing text using OpenAI's tiktoken library and HuggingFace transformers. Get accurate token counts for GPT models, LLaMA, BERT and more with a single command.


Features

  • Simple Usage: Just tok 'your text' - that's it!
  • 26 Tokenizers: Best from OpenAI's tiktoken (tt) and HuggingFace transformers (hf) libraries - GPT, Deepseek, Llama, Qwen, Bert and other tokenizers - all in one place.
  • Flexible Output: JSON, plain text, and summary output formats
  • Configuration: Persistent configuration for default tokenizer and delimiter
  • Text Analysis: Token count, word count, character count, and token frequency
  • Cross-platform: Works on Windows, macOS, and Linux
  • 100% local: Works fully locally on device after installation

Installation

Install from PyPI with pip:

pip install tokker

That's it! The tok command is now available in your terminal.


Command Reference

usage: tok [-h] [--tokenizer TOKENIZER] [--format {json,plain,summary}]
           [--tokenizer-default TOKENIZER] [--tokenizer-list]
           [text]

positional arguments:
  text                  Text to tokenize (or read from stdin if not provided)

options:
  -h, --help           Show this help message and exit
  --tokenizer TOKENIZER
                       Tokenizer to use (overrides default). Use --tokenizer-list to see available options
  --format {json,plain,summary}
                       Output format (default: json)
  --tokenizer-default TOKENIZER
                       Set the default tokenizer in configuration. Use --tokenizer-list to see available options
  --tokenizer-list     List all available tokenizers with descriptions

Usage

Tip: When using bash or zsh, wrap input text in single quotes ('like this'). Double quotes cause issues with special characters such as ! (used for history expansion).

Tokenize

# Tokenize with default tokenizer
tok 'Hello world'

# Get a specific output format
tok 'Hello world' --format plain

# Use a specific tokenizer
tok 'Hello world' --tokenizer gpt2

# Pipe text from other commands
echo "Hello world" | tok
cat file.txt | tok --format summary

Tokenize (Pipeline)

# Process files
cat document.txt | tok --tokenizer gpt2 --format summary

# Chain with other tools
curl -s https://example.com | tok --tokenizer bert-base-uncased

# Compare tokenizers
echo "Machine learning is awesome" | tok --tokenizer gpt2
echo "Machine learning is awesome" | tok --tokenizer bert-base-uncased

List Available Tokenizers

# See all available tokenizers
tok --tokenizer-list

Output:

DeepSeek Family:
================
  deepseek-ai/DeepSeek-Coder-V2-Base    (hf) — BPE, used by DeepSeek-Coder-V2
  deepseek-ai/DeepSeek-V2               (hf) — BPE, used by DeepSeek-V2

GPT Family:
===========
  cl100k_base                           (tt) — BPE, used by GPT-3.5, GPT-4
  gpt2                                  (hf) — BPE, used by GPT-2
  o200k_base                            (tt) — BPE, used by GPT-4o, o-family (o1, o3, o4)
  p50k_base                             (tt) — BPE, used by GPT-3.5
  p50k_edit                             (tt) — BPE, used by GPT-3 edit models for text and code
  r50k_base                             (tt) — BPE, used by GPT-3 base models

LLaMA Family:
=============
  meta-llama/Llama-2-70b-hf             (hf) — BPE, used by LLaMA-2
  meta-llama/Meta-Llama-3-70B           (hf) — BPE, used by LLaMA-3
  meta-llama/Meta-Llama-3.1-405B        (hf) — BPE, used by LLaMA-3.1

Qwen Family:
============
  Qwen/Qwen-72B                         (hf) — BPE, used by Qwen
  Qwen/Qwen1.5-110B                     (hf) — BPE, used by Qwen1.5
  Qwen/Qwen2-72B                        (hf) — BPE, used by Qwen2
  Qwen/Qwen2.5-72B                      (hf) — BPE, used by Qwen2.5

Other:
======
  allenai/longformer-base-4096          (hf) — BPE, used by Longformer
  bert-base-cased                       (hf) — WordPiece, used by BERT
  bert-base-uncased                     (hf) — WordPiece, used by BERT
  distilbert-base-cased                 (hf) — WordPiece, used by DistilBERT
  distilbert-base-uncased               (hf) — WordPiece, used by DistilBERT
  facebook/bart-base                    (hf) — BPE, used by BART
  google/electra-base-discriminator     (hf) — WordPiece, used by ELECTRA
  microsoft/deberta-base                (hf) — SentencePiece, used by DeBERTa
  roberta-base                          (hf) — BPE, used by RoBERTa
  t5-base                               (hf) — SentencePiece, used by T5
  xlnet-base-cased                      (hf) — SentencePiece, used by XLNet

Set Default Tokenizer

# Set your preferred tokenizer
tok --tokenizer-default o200k_base

Output:

✓ Default tokenizer set to: o200k_base (tt) — BPE, used by GPT-4o, o-family (o1, o3, o4)
Configuration saved to: ~/.config/tokker/tokenizer_config.json

Output Formats

Full JSON Output (Default)

$ tok 'Hello world'
{
  "converted": "Hello⎮ world",
  "token_strings": ["Hello", " world"],
  "token_ids": [24912, 2375],
  "token_count": 2,
  "word_count": 2,
  "char_count": 11,
  "pivot": {
    "Hello": 1,
    " world": 1
  },
  "tokenizer": "o200k_base",
  "library": "tt"
}

Plain Text Output

$ tok 'Hello world' --format plain
Hello⎮ world

Summary Output

$ tok 'Hello world' --format summary
{
  "token_count": 2,
  "word_count": 2,
  "char_count": 11,
  "tokenizer": "o200k_base",
  "library": "tt"
}

Tokenizer List JSON

# Get tokenizer list as JSON
tok --tokenizer-list --format json

# Process and extract token count
tok 'Hello world' --format summary | jq '.token_count'

Configuration

Tokker stores your preferences in ~/.config/tokker/tokenizer_config.json:

{
  "default_tokenizer": "o200k_base",
  "delimiter": "⎮"
}

Programmatic Usage

You can also use tokker in your Python code:

import tokker

# Count tokens
count = tokker.count_tokens("Hello world", "o200k_base")
print(f"Token count: {count}")

# Full tokenization
result = tokker.tokenize("Hello world", "gpt2")
print(result["token_count"])

# List available tokenizers
tokenizers = tokker.list_tokenizers()
for tokenizer in tokenizers:
    print(f"{tokenizer['name']} ({tokenizer['library']}) — {tokenizer['description']}")

License

This project is licensed under the MIT License - see the LICENSE file for details.


Contributing

Issues and pull requests are welcome! Visit the GitHub repository.


Acknowledgments

  • OpenAI for the tiktoken library
  • HuggingFace for the transformers library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokker-0.2.0-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file tokker-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tokker-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for tokker-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5fbc4f8b815ce433d1a94f414b8f0c97f1988c9e4a640930612a41b7229241df
MD5 c487f6010947119ef3a52c7fa33d9101
BLAKE2b-256 ba8a181beccf0b1c6ebe56329ea84b5c2cbecf67b3a888ec2cb1a8b964fc66c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page