Skip to main content

A super simple CLI for text tokenization

Project description

🧩 Tokenize CLI

An easy-to-use command-line tool for tokenizing text files or folders using 🤗 Hugging Face Transformers. Ideal for quick inspection of token counts, model comparisons, or debugging chat templates.


🚀 Features

  • 🔤 Tokenize any text file with your favorite Hugging Face tokenizer
  • 🧮 Count special, non-special, and total tokens
  • 💬 Apply chat templates to structured JSON conversation files
  • ⚙️ Persist your tokenizer configuration in a simple local JSON file
  • ⚡ Uses fire for a fast CLI and Hugging Face transformers under the hood
  • 🪶 Minimal setup — zero boilerplate

📦 Installation

You can install Tokenize CLI using uv, a modern, lightning-fast Python package manager.

🔧 Quick Install

uv pip install tokenize-cli

🧩 Install from Local Directory

If you have cloned the repository locally:

uv pip install .

🌀 Install Directly from GitHub

uv pip install git+https://github.com/ZoneTwelve/tokenize-cli.git

🧑‍💻 Development Install

If you’re working on the project locally (editable mode):

uv pip install -e .

🧠 Usage

1️⃣ Configure Your Tokenizer

Before tokenizing any files, set your preferred model:

tokenize-cli --model google/gemma-3-270m-it

Inspect your current configuration:

tokenize-cli

Example output:

📄 Current config: {'tokenizer_model': 'google/gemma-3-270m-it'}

2️⃣ Tokenize a Text File

Use the tokenize command to analyze a file:

tokenize file path/to/file.txt

Example output:

🔤 Using tokenizer: google/gemma-3-270m-it
📄 Tokenizing file: example.txt
✅ Tokenization complete.
🧩 Special tokens    : 3
🔡 Non-special tokens: 197
🔢 Total tokens      : 200

3️⃣ Apply Chat Templates

For chat-style data (like messages.json):

tokenize chat examples/messages.json

To tokenize the resulting text and count total tokens:

tokenize chat examples/messages.json --tokenize True

Optionally, save the chat-rendered text to a file:

tokenize chat examples/messages.json --save chat_output.txt

🧰 Command Reference

Command Description
tokenize Tokenize text files and count tokens
tokenize-cli Configure or inspect tokenizer settings

Example commands:

# Configure model
tokenize-cli --model google/gemma-3-270m-it

# Tokenize plain text file
tokenize file my_text.txt

# Apply chat template
tokenize chat examples/messages.json

# Tokenize chat template output
tokenize chat examples/messages.json --tokenize True

🧩 Example Data

Example chat file: examples/messages.json

[
  {"role": "system", "content": "You are an tokenizer."},
  {"role": "user", "content": "This is an easy to use tokenize CLI"},
  {"role": "assistant", "content": "You are 100% correct, this is an easy to use tokenize CLI and I hope you like it."}
]

🧑‍💻 Development

Clone and install with uv:

git clone https://github.com/ZoneTwelve/tokenize-cli.git
cd tokenize-cli
uv pip install -e .

Run locally:

python src/main.py file examples/messages.json

Or configure the tokenizer:

python src/cli.py --model Qwen/Qwen3-0.6B

⚖️ License

This project is licensed under the GNU Affero General Public License v3.0. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenize_cli-0.1.1.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenize_cli-0.1.1-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file tokenize_cli-0.1.1.tar.gz.

File metadata

  • Download URL: tokenize_cli-0.1.1.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for tokenize_cli-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9ad4266953306133037e1019ab4ebe937ba88d3b21f01372b53cab1a57fc30b1
MD5 21435c7d29b83489e5d13bd4ecf1efb6
BLAKE2b-256 22ac3d985986b2293ce4337940791fbdf6908890d7c3a1e350d5dfb036a14fc5

See more details on using hashes here.

File details

Details for the file tokenize_cli-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: tokenize_cli-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for tokenize_cli-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 58509455b196449731f503796669a6165606ae747d6a787d781a25c372d993ec
MD5 963fadb2d899f40cc8f6ead86d3ccbc7
BLAKE2b-256 ccef29b6dcf763b153512aa7feff1c6464a8f84a8008a618adaa461e10a13f75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page