Visualize how LLMs tokenize text - see the world through the eyes of language models

These details have not been verified by PyPI

Project links

Project description

LLMVision

See the world through the eyes of language models

LLMVision is a Python library for visualizing how Large Language Models (LLMs) tokenize text. It reveals the hidden world of tokens that LLMs actually see, including the often surprising ways they split words, handle Unicode, and represent emojis.

Why LLMVision?

Ever wondered why your LLM usage costs more for emoji-heavy text? Or why certain prompts seem to work better than others? LLMVision helps you understand by showing exactly how text is tokenized.

from llmvision import tokenize_and_visualize, GPT4Tokenizer

text = "Hello world! 👋🌍"
print(tokenize_and_visualize(text, GPT4Tokenizer()))
# Output: Hello│ world│!│<bytes:20f09f>│<bytes:91>│<bytes:8b>│<bytes:f09f>│<bytes:8c>│<bytes:8d>

That friendly wave emoji? It's actually 3 tokens! The Earth emoji? Another 3 tokens. That's why emoji-rich text can be expensive.

Features

🔍 Multiple Tokenizers: GPT-2, GPT-4, byte-level, character-level, and more
👁️ Visual Representation: See token boundaries with clear separators
📊 Token Statistics: Analyze token counts, categories, and patterns
🌍 Unicode Handling: Properly handles emojis, multi-language text, and special characters
🎯 LLM-Faithful: Shows actual tokenization used by real language models

Installation

pip install llmvision

Quick Start

Command Line

# Basic usage
llmvision "Hello world!"

# Show all tokenizers
llmvision "Hello world!" --all

# Show token indices
llmvision "Hello world!" --indices

# Show statistics
llmvision "Hello world!" --stats

Python API

from llmvision import tokenize_and_visualize, GPT4Tokenizer, SimpleTokenizer

# Quick visualization
text = "The tokenization process is fascinating!"
print(tokenize_and_visualize(text))

# Use specific tokenizer
gpt4 = GPT4Tokenizer()
print(tokenize_and_visualize(text, gpt4))

# Show token indices
print(tokenize_and_visualize(text, gpt4, show_indices=True))

Examples

See How LLMs Really See Text

from llmvision import show_tokenizer_comparison

# Compare different tokenizer views
show_tokenizer_comparison("Hello 世界! 🌍")

Understand Token Boundaries

from llmvision import GPT4Tokenizer

tokenizer = GPT4Tokenizer()
tokens = tokenizer.tokenize("Hello world!")
print(tokens)
# ['Hello', ' world', '!']
# Note: the space is part of 'world' token!

Analyze Token Usage

from llmvision import analyze_tokens, SimpleTokenizer

text = "The quick brown fox jumps over the lazy dog."
tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize(text)
stats = analyze_tokens(tokens)

print(f"Total tokens: {stats['total_tokens']}")
print(f"Unique tokens: {stats['unique_tokens']}")
print(f"Average token length: {stats['avg_token_length']:.2f}")

Available Tokenizers

SimpleTokenizer: Basic word/punctuation/space tokenization
WordTokenizer: Whitespace-based tokenization
CharTokenizer: Character-level tokenization
GraphemeTokenizer: Unicode grapheme clusters (user-perceived characters)
ByteLevelTokenizer: Raw UTF-8 byte representation
GPT2Tokenizer: Actual GPT-2 tokenization (via tiktoken)
GPT4Tokenizer: Actual GPT-4 tokenization (via tiktoken)
SubwordTokenizer: Simple subword tokenization
LLMStyleTokenizer: Simulates common LLM tokenization patterns
SentencePieceStyleTokenizer: Simulates SentencePiece-style tokenization

Understanding Token Costs

Different tokenizers can result in vastly different token counts:

from llmvision import GPT4Tokenizer, tokenize_and_visualize

examples = [
    "Hello world!",          # 3 tokens
    "Hello 世界!",           # 5 tokens (Chinese costs more!)
    "Hello 👋🌍!",           # 8 tokens (emojis are expensive!)
    "👨‍👩‍👧‍👦",                    # 18 tokens (!!)
]

tokenizer = GPT4Tokenizer()
for text in examples:
    tokens = tokenizer.tokenize(text)
    print(f"{text:20} → {len(tokens)} tokens")

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with tiktoken for accurate GPT tokenization.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jul 26, 2025

This version

0.1.0

Jul 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmvision-0.1.0.tar.gz (56.0 kB view details)

Uploaded Jul 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmvision-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Jul 26, 2025 Python 3

File details

Details for the file llmvision-0.1.0.tar.gz.

File metadata

Download URL: llmvision-0.1.0.tar.gz
Upload date: Jul 26, 2025
Size: 56.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.20

File hashes

Hashes for llmvision-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`14cb0b72f498755c8e7674a656ab0308e4cc118b0624862da8424180f6ece95b`
MD5	`a90e458a7a8cae5f31dd60176fcac128`
BLAKE2b-256	`7ce58eadf7ed3a0da7d013cb02b3943e460233bdda52925a39cce360adbad319`

See more details on using hashes here.

File details

Details for the file llmvision-0.1.0-py3-none-any.whl.

File metadata

Download URL: llmvision-0.1.0-py3-none-any.whl
Upload date: Jul 26, 2025
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.20

File hashes

Hashes for llmvision-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c19997f574f60c12ef6f0aeb21489d9b7ac261da1ccbdd614691fbdbbeb7fd00`
MD5	`8c20eaa5d5d15d62c12b80917370ad89`
BLAKE2b-256	`d718ad31441cc64491b1e7c33895d50badc9fd6180f6f2866ec1ad12f79c3a4e`

See more details on using hashes here.

llmvision 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

LLMVision

Why LLMVision?

Features

Installation

Quick Start

Command Line

Python API

Examples

See How LLMs Really See Text

Understand Token Boundaries

Analyze Token Usage

Available Tokenizers

Understanding Token Costs

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes