Skip to main content

Visualize how LLMs tokenize text - see the world through the eyes of language models

Project description

LLMVision

See the world through the eyes of language models

LLMVision is a Python library for visualizing how Large Language Models (LLMs) tokenize text. It reveals the hidden world of tokens that LLMs actually see, including the often surprising ways they split words, handle Unicode, and represent emojis.

Why LLMVision?

Ever wondered why your LLM usage costs more for emoji-heavy text? Or why certain prompts seem to work better than others? LLMVision helps you understand by showing exactly how text is tokenized.

from llmvision import tokenize_and_visualize, GPT4Tokenizer

text = "Hello world! 👋🌍"
print(tokenize_and_visualize(text, GPT4Tokenizer()))
# Output: Hello│ world│!│<bytes:20f09f>│<bytes:91>│<bytes:8b>│<bytes:f09f>│<bytes:8c>│<bytes:8d>

That friendly wave emoji? It's actually 3 tokens! The Earth emoji? Another 3 tokens. That's why emoji-rich text can be expensive.

Features

  • 🔍 Multiple Tokenizers: GPT-2, GPT-4, byte-level, character-level, and more
  • 👁️ Visual Representation: See token boundaries with clear separators
  • 📊 Token Statistics: Analyze token counts, categories, and patterns
  • 🌍 Unicode Handling: Properly handles emojis, multi-language text, and special characters
  • 🎯 LLM-Faithful: Shows actual tokenization used by real language models

Installation

pip install llmvision

Quick Start

Command Line

# Basic usage
llmvision "Hello world!"

# Show all tokenizers
llmvision "Hello world!" --all

# Show token indices
llmvision "Hello world!" --indices

# Show statistics
llmvision "Hello world!" --stats

Python API

from llmvision import tokenize_and_visualize, GPT4Tokenizer, SimpleTokenizer

# Quick visualization
text = "The tokenization process is fascinating!"
print(tokenize_and_visualize(text))

# Use specific tokenizer
gpt4 = GPT4Tokenizer()
print(tokenize_and_visualize(text, gpt4))

# Show token indices
print(tokenize_and_visualize(text, gpt4, show_indices=True))

Examples

See How LLMs Really See Text

from llmvision import show_tokenizer_comparison

# Compare different tokenizer views
show_tokenizer_comparison("Hello 世界! 🌍")

Understand Token Boundaries

from llmvision import GPT4Tokenizer

tokenizer = GPT4Tokenizer()
tokens = tokenizer.tokenize("Hello world!")
print(tokens)
# ['Hello', ' world', '!']
# Note: the space is part of 'world' token!

Analyze Token Usage

from llmvision import analyze_tokens, SimpleTokenizer

text = "The quick brown fox jumps over the lazy dog."
tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize(text)
stats = analyze_tokens(tokens)

print(f"Total tokens: {stats['total_tokens']}")
print(f"Unique tokens: {stats['unique_tokens']}")
print(f"Average token length: {stats['avg_token_length']:.2f}")

Available Tokenizers

  • SimpleTokenizer: Basic word/punctuation/space tokenization
  • WordTokenizer: Whitespace-based tokenization
  • CharTokenizer: Character-level tokenization
  • GraphemeTokenizer: Unicode grapheme clusters (user-perceived characters)
  • ByteLevelTokenizer: Raw UTF-8 byte representation
  • GPT2Tokenizer: Actual GPT-2 tokenization (via tiktoken)
  • GPT4Tokenizer: Actual GPT-4 tokenization (via tiktoken)
  • SubwordTokenizer: Simple subword tokenization
  • LLMStyleTokenizer: Simulates common LLM tokenization patterns
  • SentencePieceStyleTokenizer: Simulates SentencePiece-style tokenization

Understanding Token Costs

Different tokenizers can result in vastly different token counts:

from llmvision import GPT4Tokenizer, tokenize_and_visualize

examples = [
    "Hello world!",          # 3 tokens
    "Hello 世界!",           # 5 tokens (Chinese costs more!)
    "Hello 👋🌍!",           # 8 tokens (emojis are expensive!)
    "👨‍👩‍👧‍👦",                    # 18 tokens (!!)
]

tokenizer = GPT4Tokenizer()
for text in examples:
    tokens = tokenizer.tokenize(text)
    print(f"{text:20}{len(tokens)} tokens")

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Acknowledgments

Built with tiktoken for accurate GPT tokenization.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmvision-0.1.0.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmvision-0.1.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file llmvision-0.1.0.tar.gz.

File metadata

  • Download URL: llmvision-0.1.0.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for llmvision-0.1.0.tar.gz
Algorithm Hash digest
SHA256 14cb0b72f498755c8e7674a656ab0308e4cc118b0624862da8424180f6ece95b
MD5 a90e458a7a8cae5f31dd60176fcac128
BLAKE2b-256 7ce58eadf7ed3a0da7d013cb02b3943e460233bdda52925a39cce360adbad319

See more details on using hashes here.

File details

Details for the file llmvision-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llmvision-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for llmvision-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c19997f574f60c12ef6f0aeb21489d9b7ac261da1ccbdd614691fbdbbeb7fd00
MD5 8c20eaa5d5d15d62c12b80917370ad89
BLAKE2b-256 d718ad31441cc64491b1e7c33895d50badc9fd6180f6f2866ec1ad12f79c3a4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page