toiro

A comparison tool of Japanese tokenizers

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

taishi-i

These details have not been verified by PyPI

Project description

toiro

PyPI - Python Version

Toiro is a comparison tool of Japanese tokenizers.

Compare the processing speed of tokenizers
Compare the words segmented in tokenizers
Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)

It also provides useful functions for natural language processing in Japanese.

Data downloader for Japanese text corpora
Preprocessor of these corpora
Text classifier for Japanese text (e.g., SVM, BERT)

Installation

Python 3.10+ is required. You can install toiro with the following command. Janome is included in the default installation.

pip install toiro

Adding a tokenizer to toiro

If you want to add a tokenizer to toiro, please install it individually. This is an example of adding SudachiPy and nagisa to toiro.

pip install sudachipy sudachidict_core
pip install nagisa

How to install other tokenizers

mecab-python3

pip install mecab-python3

GiNZA

pip install spacy ginza

spaCy

pip install spacy[ja]

KyTea

You need to install KyTea. Please refer to here.

pip install kytea

Juman++ v2

You need to install Juman++ v2. Please refer to here.

pip install pyknp

SentencePiece

pip install sentencepiece

fugashi-ipadic

pip install fugashi ipadic

fugashi-unidic

pip install fugashi unidic-lite

tinysegmenter

pip install tinysegmenter3

tiktoken

pip install tiktoken

If you want to install all the tokonizers at once, please use the following command.

pip install toiro[all_tokenizers]

Getting started

You can check the available tokonizers in your Python environment.

from toiro import tokenizers

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)

Toiro supports 12 different Japanese tokonizers and 1 BPE tokenizer. This is an example of adding SudachiPy and nagisa.

{'nagisa': {'is_available': True, 'version': '0.2.7'},
 'janome': {'is_available': True, 'version': '0.3.10'},
 'mecab-python3': {'is_available': False, 'version': False},
 'sudachipy': {'is_available': True, 'version': '0.4.9'},
 'spacy': {'is_available': False, 'version': False},
 'ginza': {'is_available': False, 'version': False},
 'kytea': {'is_available': False, 'version': False},
 'jumanpp': {'is_available': False, 'version': False},
 'sentencepiece': {'is_available': False, 'version': False},
 'fugashi-ipadic': {'is_available': False, 'version': False},
 'fugashi-unidic': {'is_available': False, 'version': False},
 'tinysegmenter': {'is_available': False, 'version': False},
 'tiktoken': {'is_available': False, 'version': False}}

Download the livedoor news corpus and compare the processing speed of tokenizers.

from toiro import tokenizers
from toiro import datadownloader

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']

# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]

# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
  'arch': 'X86_64',
  'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
  'count': 8},
 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
 'janome': {'elapsed_time': 9.114670515060425},
 'nagisa': {'elapsed_time': 15.873093605041504},
 'sudachipy': {'elapsed_time': 9.05256724357605}}

# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=>        janome: 都庁|所在地|は|新宿|区|。
#=>        nagisa: 都庁|所在|地|は|新宿|区|。
#=>     sudachipy: 都庁|所在地|は|新宿区|。

Run toiro in Docker

You can use all tokenizers by building a docker container from Docker Hub.

docker run --rm -it taishii/toiro /bin/bash

How to run the Python interpreter in the Docker container

Run the Python interpreter.

root@cdd2ad2d7092:/workspace# python3

Compare the words segmented in tokenizers

>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
 mecab-python3: 都庁|所在地|は|新宿|区|。
        janome: 都庁|所在地|は|新宿|区|。
        nagisa: 都庁|所在|地|は|新宿|区|。
     sudachipy: 都庁|所在地|は|新宿区|。
         spacy: 都庁|所在|地|は|新宿|区|。
         ginza: 都庁|所在地|は|新宿区|。
         kytea: 都庁|所在|地|は|新宿|区|。
       jumanpp: 都庁|所在|地|は|新宿|区|。
 sentencepiece: ▁|都|庁|所在地|は|新宿|区|。
fugashi-ipadic: 都庁|所在地|は|新宿|区|。
fugashi-unidic: 都庁|所在|地|は|新宿|区|。
 tinysegmenter: 都庁所|在地|は|新宿|区|。
 tiktoken_gpt4o: 都|�|�|所在地|は|新|宿|区|。
 tiktoken_gpt5: 都|�|�|所在地|は|新|宿|区|。

Get more information about toiro

The slides at PyCon JP 2020

Tutorials in Japanese

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

taishi-i

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.11

Nov 2, 2025

0.0.10

Oct 27, 2025

0.0.9

Jul 31, 2023

0.0.8

Nov 2, 2020

0.0.7

Sep 8, 2020

0.0.6

Aug 23, 2020

0.0.4

Aug 16, 2020

0.0.3

Aug 14, 2020

0.0.2

Aug 13, 2020

0.0.1

Aug 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toiro-0.0.11.tar.gz (623.1 kB view details)

Uploaded Nov 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

toiro-0.0.11-py3-none-any.whl (628.9 kB view details)

Uploaded Nov 2, 2025 Python 3

File details

Details for the file toiro-0.0.11.tar.gz.

File metadata

Download URL: toiro-0.0.11.tar.gz
Upload date: Nov 2, 2025
Size: 623.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for toiro-0.0.11.tar.gz
Algorithm	Hash digest
SHA256	`ab2c135fc6998799bc163119170e2c7dc879adaf1bfd9474b26921aaf2f359ee`
MD5	`7a97a01c254fa503cada25499cb90f78`
BLAKE2b-256	`6fb0259ef113b4c88286f09cf7449f970d81a03c1cc94c9daf936f05713a2bcc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for toiro-0.0.11.tar.gz:

Publisher: python-publish.yml on taishi-i/toiro

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: toiro-0.0.11.tar.gz
- Subject digest: ab2c135fc6998799bc163119170e2c7dc879adaf1bfd9474b26921aaf2f359ee
- Sigstore transparency entry: 661575991
- Sigstore integration time: Nov 2, 2025
Source repository:
- Permalink: taishi-i/toiro@1bcacd889a874933d5ed8f810ec352c12ffbc64a
- Branch / Tag: refs/tags/0.0.11
- Owner: https://github.com/taishi-i
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@1bcacd889a874933d5ed8f810ec352c12ffbc64a
- Trigger Event: release

File details

Details for the file toiro-0.0.11-py3-none-any.whl.

File metadata

Download URL: toiro-0.0.11-py3-none-any.whl
Upload date: Nov 2, 2025
Size: 628.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for toiro-0.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b637e981cf6da7427624c032e8557bc1aec72db7aede9cd477a3c897addea50`
MD5	`65be92ef0acdc871b18ef2365fd1b5dd`
BLAKE2b-256	`14ff28acb2ad1462d4593bc353f34dfdeda810fb40ba7f4fd44fbc3ddb2e9e3d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for toiro-0.0.11-py3-none-any.whl:

Publisher: python-publish.yml on taishi-i/toiro

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: toiro-0.0.11-py3-none-any.whl
- Subject digest: 0b637e981cf6da7427624c032e8557bc1aec72db7aede9cd477a3c897addea50
- Sigstore transparency entry: 661575997
- Sigstore integration time: Nov 2, 2025
Source repository:
- Permalink: taishi-i/toiro@1bcacd889a874933d5ed8f810ec352c12ffbc64a
- Branch / Tag: refs/tags/0.0.11
- Owner: https://github.com/taishi-i
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@1bcacd889a874933d5ed8f810ec352c12ffbc64a
- Trigger Event: release

toiro 0.0.11

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

toiro

Installation

Adding a tokenizer to toiro

Getting started

Run toiro in Docker

Get more information about toiro

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance