Skip to main content

A comparison tool of Japanese tokenizers

Project description

toiro

Build Status Docker Cloud Build Status PyPI PyPI - Python Version

Toiro is a comparison tool of Japanese tokenizers.

  • Compare the processing speed of tokenizers
  • Compare the words segmented in tokenizers
  • Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)

It also provides useful functions for natural language processing in Japanese.

  • Data downloader for Japanese text corpora
  • Preprocessor of these corpora
  • Text classifier for Japanese text (e.g., SVM, BERT)

Installation

Python 3.6+ is required. You can install toiro with the following command. Janome is included in the default installation.

pip install toiro

Adding a tokenizer to toiro

If you want to add a tokenizer to toiro, please install it individually. This is an example of adding SudachiPy and nagisa to toiro.

pip install sudachipy sudachidict_core
pip install nagisa

If you want to install all the tokonizers at once, please use the following command.

pip install toiro[all_tokenizers]

Getting started

You can check the available tokonizers in your Python environment.

from toiro import tokenizers

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)

Toiro supports 9 different Japanese tokonizers. This is an example of adding SudachiPy and nagisa.

{'nagisa': {'is_available': True, 'version': '0.2.7'},
 'janome': {'is_available': True, 'version': '0.3.10'},
 'mecab-python3': {'is_available': False, 'version': False},
 'sudachipy': {'is_available': True, 'version': '0.4.9'},
 'spacy': {'is_available': False, 'version': False},
 'ginza': {'is_available': False, 'version': False},
 'kytea': {'is_available': False, 'version': False},
 'jumanpp': {'is_available': False, 'version': False},
 'sentencepiece': {'is_available': False, 'version': False}}

Download the livedoor news corpus and compare the processing speed of tokenizers.

from toiro import tokenizers
from toiro import datadownloader

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']

# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]

# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
  'arch': 'X86_64',
  'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
  'count': 8},
 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
 'janome': {'elapsed_time': 9.114670515060425},
 'nagisa': {'elapsed_time': 15.873093605041504},
 'sudachipy': {'elapsed_time': 9.05256724357605}}

# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=>        janome: 都庁|所在地|は|新宿|区|。
#=>        nagisa: 都庁|所在|地|は|新宿|区|。
#=>     sudachipy: 都庁|所在地|は|新宿区|。

Run toiro in Docker

You can use all tokenizers by building a docker container from Docker Hub.

docker run --rm -it taishii/toiro /bin/bash
How to run the Python interpreter in the Docker container

Run the Python interpreter.

root@cdd2ad2d7092:/workspace# python3

Compare the words segmented in tokenizers

>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
mecab-python3: 都庁|所在地||新宿||
       janome: 都庁|所在地||新宿||
       nagisa: 都庁|所在|||新宿||
    sudachipy: 都庁|所在地||新宿区|
        spacy: 都庁|所在|||新宿||
        ginza: 都庁|所在地||新宿区|
        kytea: 都庁|所在|||新宿||
      jumanpp: 都庁|所在|||新宿||
sentencepiece: |||所在地||新宿||

Get more information about toiro

Tutorials

Tutorials in Japanese

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toiro-0.0.4.tar.gz (614.2 kB view details)

Uploaded Source

Built Distribution

toiro-0.0.4-py3-none-any.whl (626.2 kB view details)

Uploaded Python 3

File details

Details for the file toiro-0.0.4.tar.gz.

File metadata

  • Download URL: toiro-0.0.4.tar.gz
  • Upload date:
  • Size: 614.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for toiro-0.0.4.tar.gz
Algorithm Hash digest
SHA256 d3e67222c8943e568731df2487201a1eb85d8f6921fe083a24713817eced1d1b
MD5 6127107cdd27733465ef1cdeb4b4ee52
BLAKE2b-256 2bd736132f26cba61d5d09559e513c6e66b6236d29335eea6b8caea47a5f6a15

See more details on using hashes here.

File details

Details for the file toiro-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: toiro-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 626.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for toiro-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 76fc9ec162f086e49933158c6962ff2b534c7aafb66cb1ea040e4bce93e259fd
MD5 990dd5e155a21f210c80d826a78163f9
BLAKE2b-256 092ff05f5d7925c6535bd5b816cc462462f0c1a465a294bef3d5f4c8da85aa67

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page