Project description

tokviz

File Structure

tokviz/
├── assets/
│   ├── example-deberta-v3-small.png
│   └── example-gpt2.png
├── tokviz/
│   ├── __init__.py
│   └── visualization.py
├── README.md
├── LICENSE
├── setup.py
└── pyproject.toml

tokviz is a Python library for visualizing tokenization patterns across different language models. This library offers a comprehensive platform for researchers, data scientists, and NLP enthusiasts to gain insights into how different language models process and tokenize text.

Key Features:

Model Comparison: The visualizer allows users to compare tokenization patterns across multiple language models, including popular models like GPT-2, DistilGPT-2, and DeBERTa-v3-small. By displaying color-coded tokens side by side, users can easily identify differences and similarities in tokenization behavior.

Flexible Input: Users can input any text of their choice, allowing for dynamic exploration of tokenization patterns across diverse textual inputs. Whether analyzing short sentences, paragraphs, or entire documents, the visualizer adapts to the user's input for comprehensive analysis.

Color-Coded Visualization: Tokens are color-coded based on their properties and index, providing a visually intuitive representation of tokenization patterns. This enables users to quickly identify individual tokens and patterns within the text, facilitating deeper analysis and interpretation.

Installation

You can install tokviz via pip:

pip install tokviz

Usage

from tokviz import token_visualizer

# Define input text
text = "In this example, the get_color function would need to be adjusted based on the specific properties of your model's tokenizer. \
You might want to inspect the special tokens, check if a token is part of a special group, \
or use any other relevant information provided by the tokenizer. \
Keep in mind that the color logic may vary depending on the model, \
so you need to tailor it to your specific use case."

# Compare tokenization across different language models
token_visualizer(text, models=['microsoft/deberta-v3-small', 'openai-community/gpt2'])

This will visualize tokenization patterns for the input text using the specified language models. You can pass a list of model names or identifiers to the models parameter. By default, it compares tokenization with the GPT-2 model.

example-deberta-v3-small example-gpt-2

References

This library is based on the notebook LLM Tokenizer Visualizer

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1

Feb 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokviz-0.1.tar.gz (4.3 kB view hashes)

Uploaded Feb 12, 2024 Source

Built Distribution

tokviz-0.1-py3-none-any.whl (4.5 kB view hashes)

Uploaded Feb 12, 2024 Python 3

Hashes for tokviz-0.1.tar.gz

Hashes for tokviz-0.1.tar.gz
Algorithm	Hash digest
SHA256	`225cc2fdf9d8599738a749e2c1d6e9464fded6f1260477bdb55b98784ddb9e71`
MD5	`a524eb684f2ee9c73dc39a360a54f79e`
BLAKE2b-256	`4586dd9938a9d02d89ceaee0f93d599f7cceceab874446ed0a624737e9836853`

Hashes for tokviz-0.1-py3-none-any.whl

Hashes for tokviz-0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d78ce1a5a544d665e42da690e2fd703a5f5a665f269d47451a1a280d1cdc3d5`
MD5	`d80af52ba2b94fc7c178ba39597a6b5c`
BLAKE2b-256	`2bb67aef66f012875d59fb7bb37577b4888c0d059af8777cc1502bed9c9cf3bc`