A comparison tool of Japanese tokenizers
Project description
toiro
Toiro is a comparison tool of Japanese tokenizers.
- Compare the processing speed of tokenizers
- Compare the words segmented in tokenizers
- Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)
It also provides useful functions for natural language processing in Japanese.
- Data downloader for Japanese text corpora
- Preprocessor of these corpora
- Text classifier for Japanese text (e.g., SVM, BERT)
Installation
Python 3.10+ is required. You can install toiro with the following command. Janome is included in the default installation.
pip install toiro
Adding a tokenizer to toiro
If you want to add a tokenizer to toiro, please install it individually. This is an example of adding SudachiPy and nagisa to toiro.
pip install sudachipy sudachidict_core
pip install nagisa
How to install other tokenizers
pip install mecab-python3
pip install spacy ginza
pip install spacy[ja]
You need to install KyTea. Please refer to here.
pip install kytea
You need to install Juman++ v2. Please refer to here.
pip install pyknp
pip install sentencepiece
pip install fugashi ipadic
pip install fugashi unidic-lite
pip install tinysegmenter3
pip install tiktoken
If you want to install all the tokonizers at once, please use the following command.
pip install toiro[all_tokenizers]
Getting started
You can check the available tokonizers in your Python environment.
from toiro import tokenizers
available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)
Toiro supports 12 different Japanese tokonizers and 1 BPE tokenizer. This is an example of adding SudachiPy and nagisa.
{'nagisa': {'is_available': True, 'version': '0.2.7'},
'janome': {'is_available': True, 'version': '0.3.10'},
'mecab-python3': {'is_available': False, 'version': False},
'sudachipy': {'is_available': True, 'version': '0.4.9'},
'spacy': {'is_available': False, 'version': False},
'ginza': {'is_available': False, 'version': False},
'kytea': {'is_available': False, 'version': False},
'jumanpp': {'is_available': False, 'version': False},
'sentencepiece': {'is_available': False, 'version': False},
'fugashi-ipadic': {'is_available': False, 'version': False},
'fugashi-unidic': {'is_available': False, 'version': False},
'tinysegmenter': {'is_available': False, 'version': False},
'tiktoken': {'is_available': False, 'version': False}}
Download the livedoor news corpus and compare the processing speed of tokenizers.
from toiro import tokenizers
from toiro import datadownloader
# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']
# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]
# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
'arch': 'X86_64',
'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
'count': 8},
'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
'janome': {'elapsed_time': 9.114670515060425},
'nagisa': {'elapsed_time': 15.873093605041504},
'sudachipy': {'elapsed_time': 9.05256724357605}}
# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=> janome: 都庁|所在地|は|新宿|区|。
#=> nagisa: 都庁|所在|地|は|新宿|区|。
#=> sudachipy: 都庁|所在地|は|新宿区|。
Run toiro in Docker
You can use all tokenizers by building a docker container from Docker Hub.
docker run --rm -it taishii/toiro /bin/bash
How to run the Python interpreter in the Docker container
Run the Python interpreter.
root@cdd2ad2d7092:/workspace# python3
Compare the words segmented in tokenizers
>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
mecab-python3: 都庁|所在地|は|新宿|区|。
janome: 都庁|所在地|は|新宿|区|。
nagisa: 都庁|所在|地|は|新宿|区|。
sudachipy: 都庁|所在地|は|新宿区|。
spacy: 都庁|所在|地|は|新宿|区|。
ginza: 都庁|所在地|は|新宿区|。
kytea: 都庁|所在|地|は|新宿|区|。
jumanpp: 都庁|所在|地|は|新宿|区|。
sentencepiece: ▁|都|庁|所在地|は|新宿|区|。
fugashi-ipadic: 都庁|所在地|は|新宿|区|。
fugashi-unidic: 都庁|所在|地|は|新宿|区|。
tinysegmenter: 都庁所|在地|は|新宿|区|。
tiktoken_gpt4o: 都|�|�|所在地|は|新|宿|区|。
tiktoken_gpt5: 都|�|�|所在地|は|新|宿|区|。
Get more information about toiro
The slides at PyCon JP 2020
Tutorials in Japanese
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file toiro-0.0.11.tar.gz.
File metadata
- Download URL: toiro-0.0.11.tar.gz
- Upload date:
- Size: 623.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab2c135fc6998799bc163119170e2c7dc879adaf1bfd9474b26921aaf2f359ee
|
|
| MD5 |
7a97a01c254fa503cada25499cb90f78
|
|
| BLAKE2b-256 |
6fb0259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc
|
Provenance
The following attestation bundles were made for toiro-0.0.11.tar.gz:
Publisher:
python-publish.yml on taishi-i/toiro
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
toiro-0.0.11.tar.gz -
Subject digest:
ab2c135fc6998799bc163119170e2c7dc879adaf1bfd9474b26921aaf2f359ee - Sigstore transparency entry: 661575991
- Sigstore integration time:
-
Permalink:
taishi-i/toiro@1bcacd889a874933d5ed8f810ec352c12ffbc64a -
Branch / Tag:
refs/tags/0.0.11 - Owner: https://github.com/taishi-i
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1bcacd889a874933d5ed8f810ec352c12ffbc64a -
Trigger Event:
release
-
Statement type:
File details
Details for the file toiro-0.0.11-py3-none-any.whl.
File metadata
- Download URL: toiro-0.0.11-py3-none-any.whl
- Upload date:
- Size: 628.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b637e981cf6da7427624c032e8557bc1aec72db7aede9cd477a3c897addea50
|
|
| MD5 |
65be92ef0acdc871b18ef2365fd1b5dd
|
|
| BLAKE2b-256 |
14ff28acb2ad1462d4593bc353f34dfdeda810fb40ba7f4fd44fbc3ddb2e9e3d
|
Provenance
The following attestation bundles were made for toiro-0.0.11-py3-none-any.whl:
Publisher:
python-publish.yml on taishi-i/toiro
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
toiro-0.0.11-py3-none-any.whl -
Subject digest:
0b637e981cf6da7427624c032e8557bc1aec72db7aede9cd477a3c897addea50 - Sigstore transparency entry: 661575997
- Sigstore integration time:
-
Permalink:
taishi-i/toiro@1bcacd889a874933d5ed8f810ec352c12ffbc64a -
Branch / Tag:
refs/tags/0.0.11 - Owner: https://github.com/taishi-i
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1bcacd889a874933d5ed8f810ec352c12ffbc64a -
Trigger Event:
release
-
Statement type: