Skip to main content

Japanese kanji disambiguation

Project description

yomikata

A robot reading a book

Yomikata uses context to resolve ambiguous words in Japanese. Check out the interactive demo!

Yomikata supports 130 ambiguous forms and reaches a global accuracy of 94%. See the demo page for detailed performance information.

Yomikata follows the approach of Sato et al. 2022 by fine-tuning the Tohoku group's Japanese BERT transformer to classify words into different readings based on the sentence context. A similar approach was used in English by Nicolis et al. 2021.

Yomikata recognizes ~50% more heteronyms than Sato et al. by adding support for words which are not in the original BERT vocabulary, and it expands the original Aozora Bunko and NDL titles training data to include the core BCCWJ corpus and the KWDLC corpus.

Usage

from yomikata.dbert import dBert
reader = dBert()
reader.furigana('そして、畳の表は、すでに幾年前に換えられたのか分らなかった。')
# => そして、畳の{表/おもて}は、すでに幾年前に換えられたのか分らなかった。

This example sentence, from the short story When I Was looking for a Room to Let (1923) by Mimei Ogawa, contains the very common heteronym 表, which admits the readings omote (surface) and hyō (table). Yomikata's dBert (disambiguation BERT) correctly determines that in this sentence it refers to the surface of a tatami mat and should be read omote.

The furigana function outputs the sentence with the heteronym annotated. Readings for the other words can be obtained with a simple dictionary lookup.

from yomikata.dictionary import Dictionary
dictreader = Dictionary() # defaults to unidic.
dictreader.furigana("そして、畳の{表/おもて}は、すでに幾年前に換えられたのか分らなかった。")
# => そして、{畳/たたみ}の{表/おもて}は、すでに{幾年/いくねん}{前/まえ}に{換/か}えられたのか{分/わ}らなかった。

Without Yomikata, the dictionary outputs the wrong reading for the heteronym.

Installation

pip install yomikata
python -m yomikata download

The second command is necessary to download the model weights, which are too large to host PyPI.

Inference should work fine on CPU.

For details on data processing and training, see the main notebook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yomikata-0.0.4.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

yomikata-0.0.4-py3-none-any.whl (25.1 kB view details)

Uploaded Python 3

File details

Details for the file yomikata-0.0.4.tar.gz.

File metadata

  • Download URL: yomikata-0.0.4.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for yomikata-0.0.4.tar.gz
Algorithm Hash digest
SHA256 01f45e523d993fcf944765e9b261c2b553374bc92570d2499a712660c3ca3625
MD5 a9bad51c23c40f693643dc9991fe149f
BLAKE2b-256 e9c1b4c1e262092e1964ee498b6d863036fc79ed6386cd4427ef6994e4fb1e57

See more details on using hashes here.

File details

Details for the file yomikata-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: yomikata-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 25.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for yomikata-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2b88804cb3fe122ec7248ae0f782e5388d5057adac062a66e4fda3807db7288a
MD5 6ce5deca3329eba7aebdb0c21d083e73
BLAKE2b-256 8dffa63ede033a7462a3c22e5a497841c92b992cc0989d6cd79f7bcc3ec738f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page