Qtok: quality control tool for tokenization
Project description
Qtok: Quality Control Tool for Tokenizers
Qtok is a Python-based tool designed for quality control and analysis of tokenizers used in natural language processing (NLP) tasks.
Features
- Analyze multiple tokenizer vocabularies simultaneously
- Generate statistics on token distribution
- Produce visualizations of token characteristics
- Compare multiple tokenizers
- Analyze Unicode coverage
- Assess language-specific token distributions (Latin and Cyrillic scripts)
Installation
You can install Qtok using pip:
pip install qtok
Or clone the repository and install:
git clone https://github.com/nup-csai/Qtok.git
cd Qtok
pip install .
Usage
Qtok can be used as a command-line tool:
qtok -i /path/to/tokenizer1.json /path/to/tokenizer2.json ... -l label1 label2 ... -o /path/to/output/folder
Arguments:
-i: Paths to the tokenizer JSON files or URLs (required, multiple inputs accepted)-l: Labels for the tokenizers (required, must match the number of input files)-o: Output folder for results (required)
Example:
qtok -i /path/to/tokenizer1.json /path/to/tokenizer2.json ... -l label1 label2 ... -o /path/to/output/folder
- Arguments:
-i: Paths to the tokenizer JSON files or URLs (required, multiple inputs accepted)-l: Labels for the tokenizers (required, must match the number of input files)-o: Output folder for results (required)
Output
Qtok generates several output files:
basic_stats.tsvandbasic_stats.png: Basic statistics of the tokenizersunicode_stats.tsvandunicode_stats.png: Unicode coverage statisticslatin_stats.tsvandlatin_stats.png: Statistics for Latin script tokenscyrillic_stats.tsvandcyrillic_stats.png: Statistics for Cyrillic script tokensreport.html: An HTML report summarizing all analysesreport.texandreport.pdf: LaTeX and PDF versions of the report (if pdflatex is installed)
Requirements
- Python 3.6+
- matplotlib
- numpy
- pandas
- requests
Reproducibility
For full tables and data, please refer to the Jupyter notebook available at:
Contributing
Contributions to Qtok are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Aleksey Komissarov
- Iaroslav Chelombitko
- Egor Safronov
Contact
For any queries, please contact ad3002@gmail.com.
Acknowledgments
- Thanks to all contributors and users of Qtok
- Special thanks to the NLP community for inspiration and support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file Qtok-0.10.6.tar.gz.
File metadata
- Download URL: Qtok-0.10.6.tar.gz
- Upload date:
- Size: 18.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44683fb0bbfdeba3ed3574169e5557429e36ad753f72d90dc9d3155f138e666c
|
|
| MD5 |
d9aa451e211b6a701c7c8106e810beda
|
|
| BLAKE2b-256 |
377ddc91925554f70ed42f9a55b03ff991bce0f702cbad31fb9bac222097b0ff
|