A package to visualize tokenization of text using HTML
Project description
Tokenizer Viz
Tokenizer Viz is a Python package that generates HTML to visualize the tokenization of text. It highlights tokens with different colors and customizable styles, making it easier to understand how a text is tokenized.
Project Layout
tokenizer-viz/
│
├── tokenizer_viz/
│ ├── __init__.py
│ └── viz—utils.py
│
├── .gitignore
├── LICENSE
├── README.md
└── setup.py
Installation
You can install the tokenizer-viz package using pip:
pip install tokenizer-visualizer
Usage
Here's a quick example of how to use the package:
Usage with a list of strings
from tokenizer_viz.viz_utils import get_visualization
from IPython.display import HTML
tokens = ['This', ' ', 'is', ' ', 'an', ' ', 'example', ' ', 'sentence']
html = get_visualization(tokens)
# Display the generated HTML
HTML(html)
OUTPUT
Usage with an encoder and decoder
from tokenizer_viz.viz_utils import get_visualization
from IPython.display import HTML
ascii_encoder = lambda x: [ord(char) for char in x]
ascii_decoder = lambda x: ''.join([chr(int(char)) for char in x])
corpus = "This is an example sentence"
html = get_visualization(
tokens=ascii_encoder(corpus),
decoder=ascii_decoder,
font_weight='regular',
)
# Display the generated HTML in the notebook (or wherever you're running this)
HTML(html)
OUTPUT
The get_visualization
function accepts several optional
parameters to customize the appearance and layout of the tokens:
- tokens,
- decoder (defualt=
None
), - cmap (defualt=
'Pastel1'
), - font_family (defualt=
'Courier New'
), - font_size (defualt=
'1.1em'
), - unk_token (defualt=
'???'
), - font_weight (defualt=
'bold'
), - padding (defualt=
'2px'
), - margin_right (defualt=
'1px'
), - border_radius (defualt=
'3px'
), - display_inline (defualt=
False
),
Please refer to the function docstrings for a detailed description of each parameter.
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tokenizer_viz-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d5365035db78ce40ccef24980b285636d12b90214b8d0a7e97206f4b95d7387 |
|
MD5 | 7ff543253c760248785b1d6535b7391a |
|
BLAKE2b-256 | 706821d279324b0675da210dec7677d273ce274919c9b8c0aecae87de3fc7b29 |