Skip to main content

An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists

Project description

dango

dango is an easy to use tokenizer for Japanese text, aimed at language learners and non-linguists.

$ echo "私は昨日映画を見ました" | dango   昨日 映画  見ました

If used as a library it can also provide you with additional information such as:

  • Dictionary form: For inflected words it can tell you the dictionary form for easier lookup.
  • Part-of-speech tagging: It can tell you if a word is a verb, noun, adjective, etc.
  • Reading in hiragana for words containing kanji

Installation

$ pip install dango

One of the dependencies is SudachiDict-core, which might take a while to download due to its size of ~70MB.

Usage

A simple CLI for tokenizing text is provided. Input is read from stdin or from a file.

$ echo "私は昨日映画を見ました" | tee input.txt | dango
私  昨日 映画  見ました

$ dango input.txt
私  昨日 映画  見ました

Usage as a library:

import dango

words = dango.tokenize('私は昨日映画を見ました')

print([w.surface for w in words])
# => ['私', 'は', '昨日', '映画', 'を', '見ました']

print(words[-1].part_of_speech)
# => VERB
print(words[-1].surface)
# => 見ました
print(words[-1].surface_reading)
# => みました
print(words[-1].dictionary_form)
# => 見る
print(words[-1].dictionary_form_reading)
# => みる

Motivation & Acknowledgements

dango was created out of a need to extract vocabulary in bulk from Japanese texts to serve as learning materials.

While you can get quite far by using a morphological analyzer like MeCab directly, there is the problem that it will segment text into much smaller units than one would like if you are trying to learn the language. For example 見た would be separated into and which is a bit like separating watched into watch and ed.

dango uses SudachiPy for tokenization/analysis and adds some processing to aggregate the individual tokens into words and make the part-of-speech information a bit easier to digest.

dango takes some inspiration from Ve, which provides the text parsing of jisho.org.

License

Released under the BSD-3-Clause License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dango-0.0.1.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

dango-0.0.1-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file dango-0.0.1.tar.gz.

File metadata

  • Download URL: dango-0.0.1.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.2.0 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for dango-0.0.1.tar.gz
Algorithm Hash digest
SHA256 2a44663ac105d8729259db27692f5be6b5e5ed7f3657c7ea56fae69d4a679941
MD5 c7937bc00aa97afacef2e1d661e24af1
BLAKE2b-256 0e5a1c7a2b268d6aa3e759639e03017f2f314699a724ba2e7b73c3af4e9d3055

See more details on using hashes here.

File details

Details for the file dango-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: dango-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.2.0 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for dango-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8a7e02098fb60c296809fe15764e76683072e90b403f7a70553927badb479f73
MD5 6a6eb18c95f9f1cabafa3aab7aebbe99
BLAKE2b-256 4799402739b9c6ef7927c6e9e421a1fc346f692aad271ea6a65a018c1a52cd91

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page