Skip to main content

A comparison tool of Japanese tokenizers

Project description

toiro

Build Status Docker Cloud Build Status Python Package PyPI PyPI - Python Version

Toiro is a comparison tool of Japanese tokenizers.

  • Compare the processing speed of tokenizers
  • Compare the words segmented in tokenizers
  • Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)

It also provides useful functions for natural language processing in Japanese.

  • Data downloader for Japanese text corpora
  • Preprocessor of these corpora
  • Text classifier for Japanese text (e.g., SVM, BERT)

Installation

Python 3.6+ is required. You can install toiro with the following command. Janome is included in the default installation.

pip install toiro

Adding a tokenizer to toiro

If you want to add a tokenizer to toiro, please install it individually. This is an example of adding SudachiPy and nagisa to toiro.

pip install sudachipy sudachidict_core
pip install nagisa
How to install other tokenizers

mecab-python3

pip install mecab-python3==0.996.5

GiNZA

pip install spacy ginza

spaCy

pip install spacy[ja]

KyTea

You need to install KyTea. Please refer to here.

pip install kytea

Juman++ v2

You need to install Juman++ v2. Please refer to here.

pip install pyknp

SentencePiece

pip install sentencepiece

fugashi-ipadic

pip install fugashi ipadic

fugashi-unidic

pip install fugashi unidic-lite

tinysegmenter

pip install tinysegmenter3

If you want to install all the tokonizers at once, please use the following command.

pip install toiro[all_tokenizers]

Getting started

You can check the available tokonizers in your Python environment.

from toiro import tokenizers

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)

Toiro supports 12 different Japanese tokonizers. This is an example of adding SudachiPy and nagisa.

{'nagisa': {'is_available': True, 'version': '0.2.7'},
 'janome': {'is_available': True, 'version': '0.3.10'},
 'mecab-python3': {'is_available': False, 'version': False},
 'sudachipy': {'is_available': True, 'version': '0.4.9'},
 'spacy': {'is_available': False, 'version': False},
 'ginza': {'is_available': False, 'version': False},
 'kytea': {'is_available': False, 'version': False},
 'jumanpp': {'is_available': False, 'version': False},
 'sentencepiece': {'is_available': False, 'version': False},
 'fugashi-ipadic': {'is_available': False, 'version': False},
 'fugashi-unidic': {'is_available': False, 'version': False},
 'tinysegmenter': {'is_available': False, 'version': False}}

Download the livedoor news corpus and compare the processing speed of tokenizers.

from toiro import tokenizers
from toiro import datadownloader

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']

# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]

# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
  'arch': 'X86_64',
  'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
  'count': 8},
 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
 'janome': {'elapsed_time': 9.114670515060425},
 'nagisa': {'elapsed_time': 15.873093605041504},
 'sudachipy': {'elapsed_time': 9.05256724357605}}

# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=>        janome: 都庁|所在地|は|新宿|区|。
#=>        nagisa: 都庁|所在|地|は|新宿|区|。
#=>     sudachipy: 都庁|所在地|は|新宿区|。

Run toiro in Docker

You can use all tokenizers by building a docker container from Docker Hub.

docker run --rm -it taishii/toiro /bin/bash
How to run the Python interpreter in the Docker container

Run the Python interpreter.

root@cdd2ad2d7092:/workspace# python3

Compare the words segmented in tokenizers

>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
 mecab-python3: 都庁|所在地||新宿||
        janome: 都庁|所在地||新宿||
        nagisa: 都庁|所在|||新宿||
     sudachipy: 都庁|所在地||新宿区|
         spacy: 都庁|所在|||新宿||
         ginza: 都庁|所在地||新宿区|
         kytea: 都庁|所在|||新宿||
       jumanpp: 都庁|所在|||新宿||
 sentencepiece: |||所在地||新宿||
fugashi-ipadic: 都庁|所在地||新宿||
fugashi-unidic: 都庁|所在|||新宿||
 tinysegmenter: 都庁所|在地||新宿||

Get more information about toiro

The slides at PyCon JP 2020

Tutorials in Japanese

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for toiro, version 0.0.8
Filename, size File type Python version Upload date Hashes
Filename, size toiro-0.0.8-py3-none-any.whl (628.0 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size toiro-0.0.8.tar.gz (616.3 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page