Skip to main content

Library for visualizing tokenization patterns across different language models

Project description

tokviz

File Structure

tokviz/
├── assets/
│   ├── example-deberta-v3-small.png
│   └── example-gpt2.png
├── tokviz/
│   ├── __init__.py
│   └── visualization.py
├── README.md
├── LICENSE
├── setup.py
└── pyproject.toml

tokviz is a Python library for visualizing tokenization patterns across different language models. This library offers a comprehensive platform for researchers, data scientists, and NLP enthusiasts to gain insights into how different language models process and tokenize text.

Key Features:

Model Comparison: The visualizer allows users to compare tokenization patterns across multiple language models, including popular models like GPT-2, DistilGPT-2, and DeBERTa-v3-small. By displaying color-coded tokens side by side, users can easily identify differences and similarities in tokenization behavior.

Flexible Input: Users can input any text of their choice, allowing for dynamic exploration of tokenization patterns across diverse textual inputs. Whether analyzing short sentences, paragraphs, or entire documents, the visualizer adapts to the user's input for comprehensive analysis.

Color-Coded Visualization: Tokens are color-coded based on their properties and index, providing a visually intuitive representation of tokenization patterns. This enables users to quickly identify individual tokens and patterns within the text, facilitating deeper analysis and interpretation.

Installation

You can install tokviz via pip:

pip install tokviz

Usage

from tokviz import token_visualizer

# Define input text
text = "In this example, the get_color function would need to be adjusted based on the specific properties of your model's tokenizer. \
You might want to inspect the special tokens, check if a token is part of a special group, \
or use any other relevant information provided by the tokenizer. \
Keep in mind that the color logic may vary depending on the model, \
so you need to tailor it to your specific use case."

# Compare tokenization across different language models
token_visualizer(text, models=['microsoft/deberta-v3-small', 'openai-community/gpt2'])

This will visualize tokenization patterns for the input text using the specified language models. You can pass a list of model names or identifiers to the models parameter. By default, it compares tokenization with the GPT-2 model.

example-deberta-v3-small example-gpt-2

References

This library is based on the notebook LLM Tokenizer Visualizer

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokviz-0.1.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

tokviz-0.1-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file tokviz-0.1.tar.gz.

File metadata

  • Download URL: tokviz-0.1.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.0

File hashes

Hashes for tokviz-0.1.tar.gz
Algorithm Hash digest
SHA256 225cc2fdf9d8599738a749e2c1d6e9464fded6f1260477bdb55b98784ddb9e71
MD5 a524eb684f2ee9c73dc39a360a54f79e
BLAKE2b-256 4586dd9938a9d02d89ceaee0f93d599f7cceceab874446ed0a624737e9836853

See more details on using hashes here.

File details

Details for the file tokviz-0.1-py3-none-any.whl.

File metadata

  • Download URL: tokviz-0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.0

File hashes

Hashes for tokviz-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1d78ce1a5a544d665e42da690e2fd703a5f5a665f269d47451a1a280d1cdc3d5
MD5 d80af52ba2b94fc7c178ba39597a6b5c
BLAKE2b-256 2bb67aef66f012875d59fb7bb37577b4888c0d059af8777cc1502bed9c9cf3bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page