Skip to main content

An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists

Project description

dango

dango is an easy to use tokenizer for Japanese text, aimed at language learners and non-linguists.

$ echo "私は昨日映画を見ました" | dango   昨日 映画  見ました

If used as a library it can also provide you with additional information such as:

  • Dictionary form: For inflected words it can tell you the dictionary form for easier lookup.
  • Part-of-speech tagging: It can tell you if a word is a verb, noun, adjective, etc.
  • Reading in hiragana for words containing kanji

Installation

$ pip install dango

One of the dependencies is SudachiDict-core, which might take a while to download due to its size of ~70MB.

Usage

A simple CLI for tokenizing text is provided. Input is read from stdin or from a file.

$ echo "私は昨日映画を見ました" | tee input.txt | dango
私  昨日 映画  見ました

$ dango input.txt
私  昨日 映画  見ました

Usage as a library:

import dango

words = dango.tokenize('私は昨日映画を見ました')

print([w.surface for w in words])
# => ['私', 'は', '昨日', '映画', 'を', '見ました']

print(words[-1].part_of_speech)
# => VERB
print(words[-1].surface)
# => 見ました
print(words[-1].surface_reading)
# => みました
print(words[-1].dictionary_form)
# => 見る
print(words[-1].dictionary_form_reading)
# => みる

Motivation & Acknowledgements

dango was created out of a need to extract vocabulary in bulk from Japanese texts to serve as learning materials.

While you can get quite far by using a morphological analyzer like MeCab directly, there is the problem that it will segment text into much smaller units than one would like if you are trying to learn the language. For example 見た would be separated into and which is a bit like separating watched into watch and ed.

dango uses SudachiPy for tokenization/analysis and adds some processing to aggregate the individual tokens into words and make the part-of-speech information a bit easier to digest.

dango takes some inspiration from Ve, which provides the text parsing of jisho.org.

License

Released under the BSD-3-Clause License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dango-0.0.1.tar.gz (8.9 kB view hashes)

Uploaded Source

Built Distribution

dango-0.0.1-py3-none-any.whl (10.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page