Visualize how LLMs tokenize text - see the world through the eyes of language models
Project description
LLMVision
See the world through the eyes of language models
LLMVision is a Python library for visualizing how Large Language Models (LLMs) tokenize text. It reveals the hidden world of tokens that LLMs actually see, including the often surprising ways they split words, handle Unicode, and represent emojis.
Why LLMVision?
Ever wondered why your LLM usage costs more for emoji-heavy text? Or why certain prompts seem to work better than others? LLMVision helps you understand by showing exactly how text is tokenized.
from llmvision import tokenize_and_visualize, GPT4Tokenizer
text = "Hello world! 👋🌍"
print(tokenize_and_visualize(text, GPT4Tokenizer()))
# Output: Hello│ world│!│<bytes:20f09f>│<bytes:91>│<bytes:8b>│<bytes:f09f>│<bytes:8c>│<bytes:8d>
That friendly wave emoji? It's actually 3 tokens! The Earth emoji? Another 3 tokens. That's why emoji-rich text can be expensive.
Features
- 🔍 Multiple Tokenizers: GPT-2, GPT-4, byte-level, character-level, and more
- 👁️ Visual Representation: See token boundaries with clear separators
- 📊 Token Statistics: Analyze token counts, categories, and patterns
- 🌍 Unicode Handling: Properly handles emojis, multi-language text, and special characters
- 🎯 LLM-Faithful: Shows actual tokenization used by real language models
Installation
pip install llmvision
Quick Start
Command Line
# Basic usage
llmvision "Hello world!"
# Show all tokenizers
llmvision "Hello world!" --all
# Show token indices
llmvision "Hello world!" --indices
# Show statistics
llmvision "Hello world!" --stats
Python API
from llmvision import tokenize_and_visualize, GPT4Tokenizer, SimpleTokenizer
# Quick visualization
text = "The tokenization process is fascinating!"
print(tokenize_and_visualize(text))
# Use specific tokenizer
gpt4 = GPT4Tokenizer()
print(tokenize_and_visualize(text, gpt4))
# Show token indices
print(tokenize_and_visualize(text, gpt4, show_indices=True))
Examples
See How LLMs Really See Text
from llmvision import show_tokenizer_comparison
# Compare different tokenizer views
show_tokenizer_comparison("Hello 世界! 🌍")
Understand Token Boundaries
from llmvision import GPT4Tokenizer
tokenizer = GPT4Tokenizer()
tokens = tokenizer.tokenize("Hello world!")
print(tokens)
# ['Hello', ' world', '!']
# Note: the space is part of 'world' token!
Analyze Token Usage
from llmvision import analyze_tokens, SimpleTokenizer
text = "The quick brown fox jumps over the lazy dog."
tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize(text)
stats = analyze_tokens(tokens)
print(f"Total tokens: {stats['total_tokens']}")
print(f"Unique tokens: {stats['unique_tokens']}")
print(f"Average token length: {stats['avg_token_length']:.2f}")
Available Tokenizers
- SimpleTokenizer: Basic word/punctuation/space tokenization
- WordTokenizer: Whitespace-based tokenization
- CharTokenizer: Character-level tokenization
- GraphemeTokenizer: Unicode grapheme clusters (user-perceived characters)
- ByteLevelTokenizer: Raw UTF-8 byte representation
- GPT2Tokenizer: Actual GPT-2 tokenization (via tiktoken)
- GPT4Tokenizer: Actual GPT-4 tokenization (via tiktoken)
- SubwordTokenizer: Simple subword tokenization
- LLMStyleTokenizer: Simulates common LLM tokenization patterns
- SentencePieceStyleTokenizer: Simulates SentencePiece-style tokenization
Understanding Token Costs
Different tokenizers can result in vastly different token counts:
from llmvision import GPT4Tokenizer, tokenize_and_visualize
examples = [
"Hello world!", # 3 tokens
"Hello 世界!", # 5 tokens (Chinese costs more!)
"Hello 👋🌍!", # 8 tokens (emojis are expensive!)
"👨👩👧👦", # 18 tokens (!!)
]
tokenizer = GPT4Tokenizer()
for text in examples:
tokens = tokenizer.tokenize(text)
print(f"{text:20} → {len(tokens)} tokens")
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Acknowledgments
Built with tiktoken for accurate GPT tokenization.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmvision-0.1.0.tar.gz.
File metadata
- Download URL: llmvision-0.1.0.tar.gz
- Upload date:
- Size: 56.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14cb0b72f498755c8e7674a656ab0308e4cc118b0624862da8424180f6ece95b
|
|
| MD5 |
a90e458a7a8cae5f31dd60176fcac128
|
|
| BLAKE2b-256 |
7ce58eadf7ed3a0da7d013cb02b3943e460233bdda52925a39cce360adbad319
|
File details
Details for the file llmvision-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmvision-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c19997f574f60c12ef6f0aeb21489d9b7ac261da1ccbdd614691fbdbbeb7fd00
|
|
| MD5 |
8c20eaa5d5d15d62c12b80917370ad89
|
|
| BLAKE2b-256 |
d718ad31441cc64491b1e7c33895d50badc9fd6180f6f2866ec1ad12f79c3a4e
|