Qtok: quality control tool for tokenization
Project description
Qtok: Quality Control Tool for Tokenizers
Qtok is a Python-based tool designed for quality control and analysis of tokenizers used in natural language processing (NLP) tasks.
Features
- Analyze tokenizer vocabularies
- Generate statistics on token distribution
- Produce visualizations of token characteristics
- Compare multiple tokenizers
- Analyze Unicode coverage
- Assess language-specific token distributions (Latin and Cyrillic scripts)
Installation
You can install Qtok using pip:
pip install qtok
Or clone the repository and install:
git clone https://github.com/nup-csai/Qtok.git
cd Qtok
pip install .
Usage
Qtok can be used as a command-line tool:
qtok -i /path/to/tokenizer.json -l tokenizer_label -o /path/to/output/folder
Arguments:
-i: Path to the tokenizer JSON file (required)-l: Label for the tokenizer (optional, default is "default")-o: Output folder for results (required)
Output
Qtok generates several output files:
basic_stats.tsvandbasic_stats.png: Basic statistics of the tokenizerunicode_stats.tsvandunicode_stats.png: Unicode coverage statisticslatin_stats.tsvandlatin_stats.png: Statistics for Latin script tokenscyrillic_stats.tsvandcyrillic_stats.png: Statistics for Cyrillic script tokens
Requirements
- Python 3.6+
- matplotlib
- numpy
- pandas
Contributing
Contributions to Qtok are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Aleksey Komissarov
- Iaroslav Chelombitko
- Egor Safronov
Contact
For any queries, please contact ad3002@gmail.com.
Acknowledgments
- Thanks to all contributors and users of Qtok
- Special thanks to the NLP community for inspiration and support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Qtok-0.9.3.tar.gz
(18.5 MB
view details)
File details
Details for the file Qtok-0.9.3.tar.gz.
File metadata
- Download URL: Qtok-0.9.3.tar.gz
- Upload date:
- Size: 18.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e9600fa93e5023d76f1e56dc3f31b7368b99485f01ec1178fe761ec8254116f
|
|
| MD5 |
5fd7e7955f3f58356f045cbfb5d292c5
|
|
| BLAKE2b-256 |
deb5b9af3bc6a9f0ffb8ef36edfb57bff23ed6b2d129737bd201f8528e120542
|