A super simple CLI for text tokenization
Project description
🧩 Tokenize CLI
An easy-to-use command-line tool for tokenizing text files or folders using 🤗 Hugging Face Transformers. Ideal for quick inspection of token counts, model comparisons, or debugging chat templates.
🚀 Features
- 🔤 Tokenize any text file with your favorite Hugging Face tokenizer
- 🧮 Count special, non-special, and total tokens
- 💬 Apply chat templates to structured JSON conversation files
- ⚙️ Persist your tokenizer configuration in a simple local JSON file
- ⚡ Uses
firefor a fast CLI and Hugging Facetransformersunder the hood - 🪶 Minimal setup — zero boilerplate
📦 Installation
You can install Tokenize CLI using uv, a modern, lightning-fast Python package manager.
🔧 Quick Install
uv pip install tokenize-cli
🧩 Install from Local Directory
If you have cloned the repository locally:
uv pip install .
🌀 Install Directly from GitHub
uv pip install git+https://github.com/ZoneTwelve/tokenize-cli.git
🧑💻 Development Install
If you’re working on the project locally (editable mode):
uv pip install -e .
🧠 Usage
1️⃣ Configure Your Tokenizer
Before tokenizing any files, set your preferred model:
tokenize-cli --model google/gemma-3-270m-it
Inspect your current configuration:
tokenize-cli
Example output:
📄 Current config: {'tokenizer_model': 'google/gemma-3-270m-it'}
2️⃣ Tokenize a Text File
Use the tokenize command to analyze a file:
tokenize file path/to/file.txt
Example output:
🔤 Using tokenizer: google/gemma-3-270m-it
📄 Tokenizing file: example.txt
✅ Tokenization complete.
🧩 Special tokens : 3
🔡 Non-special tokens: 197
🔢 Total tokens : 200
3️⃣ Apply Chat Templates
For chat-style data (like messages.json):
tokenize chat examples/messages.json
To tokenize the resulting text and count total tokens:
tokenize chat examples/messages.json --tokenize True
Optionally, save the chat-rendered text to a file:
tokenize chat examples/messages.json --save chat_output.txt
🧰 Command Reference
| Command | Description |
|---|---|
tokenize |
Tokenize text files and count tokens |
tokenize-cli |
Configure or inspect tokenizer settings |
Example commands:
# Configure model
tokenize-cli --model google/gemma-3-270m-it
# Tokenize plain text file
tokenize file my_text.txt
# Apply chat template
tokenize chat examples/messages.json
# Tokenize chat template output
tokenize chat examples/messages.json --tokenize True
🧩 Example Data
Example chat file: examples/messages.json
[
{"role": "system", "content": "You are an tokenizer."},
{"role": "user", "content": "This is an easy to use tokenize CLI"},
{"role": "assistant", "content": "You are 100% correct, this is an easy to use tokenize CLI and I hope you like it."}
]
🧑💻 Development
Clone and install with uv:
git clone https://github.com/ZoneTwelve/tokenize-cli.git
cd tokenize-cli
uv pip install -e .
Run locally:
python src/main.py file examples/messages.json
Or configure the tokenizer:
python src/cli.py --model Qwen/Qwen3-0.6B
⚖️ License
This project is licensed under the GNU Affero General Public License v3.0. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenize_cli-0.1.1.tar.gz.
File metadata
- Download URL: tokenize_cli-0.1.1.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ad4266953306133037e1019ab4ebe937ba88d3b21f01372b53cab1a57fc30b1
|
|
| MD5 |
21435c7d29b83489e5d13bd4ecf1efb6
|
|
| BLAKE2b-256 |
22ac3d985986b2293ce4337940791fbdf6908890d7c3a1e350d5dfb036a14fc5
|
File details
Details for the file tokenize_cli-0.1.1-py3-none-any.whl.
File metadata
- Download URL: tokenize_cli-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58509455b196449731f503796669a6165606ae747d6a787d781a25c372d993ec
|
|
| MD5 |
963fadb2d899f40cc8f6ead86d3ccbc7
|
|
| BLAKE2b-256 |
ccef29b6dcf763b153512aa7feff1c6464a8f84a8008a618adaa461e10a13f75
|