An easy to use tokenizer for Japanese text, aimed at language learners and non-linguists
Project description
dango
dango
is an easy to use tokenizer for Japanese text, aimed at language learners and non-linguists.
$ echo "私は昨日映画を見ました" | dango
私 は 昨日 映画 を 見ました
If used as a library it can also provide you with additional information such as:
- Dictionary form: For inflected words it can tell you the dictionary form for easier lookup.
- Part-of-speech tagging: It can tell you if a word is a verb, noun, adjective, etc.
- Reading in hiragana for words containing kanji
Installation
$ pip install dango
One of the dependencies is SudachiDict-core, which might take a while to download due to its size of ~70MB.
Usage
A simple CLI for tokenizing text is provided. Input is read from stdin
or from a file.
$ echo "私は昨日映画を見ました" | tee input.txt | dango
私 は 昨日 映画 を 見ました
$ dango input.txt
私 は 昨日 映画 を 見ました
Usage as a library:
import dango
words = dango.tokenize('私は昨日映画を見ました')
print([w.surface for w in words])
# => ['私', 'は', '昨日', '映画', 'を', '見ました']
print(words[-1].part_of_speech)
# => VERB
print(words[-1].surface)
# => 見ました
print(words[-1].surface_reading)
# => みました
print(words[-1].dictionary_form)
# => 見る
print(words[-1].dictionary_form_reading)
# => みる
Motivation & Acknowledgements
dango
was created out of a need to extract vocabulary in bulk from Japanese
texts to serve as learning materials.
While you can get quite far by using a morphological analyzer like MeCab
directly, there is the problem that it will segment text into much smaller
units than one would like if you are trying to learn the language.
For example 見た
would be separated into 見
and た
which is a bit like
separating watched
into watch
and ed
.
dango
uses SudachiPy for tokenization/analysis and adds some processing
to aggregate the individual tokens into words and make the part-of-speech
information a bit easier to digest.
dango
takes some inspiration from Ve, which provides the text parsing of
jisho.org.
License
Released under the BSD-3-Clause License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dango-0.0.1.tar.gz
.
File metadata
- Download URL: dango-0.0.1.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.2.0 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a44663ac105d8729259db27692f5be6b5e5ed7f3657c7ea56fae69d4a679941 |
|
MD5 | c7937bc00aa97afacef2e1d661e24af1 |
|
BLAKE2b-256 | 0e5a1c7a2b268d6aa3e759639e03017f2f314699a724ba2e7b73c3af4e9d3055 |
File details
Details for the file dango-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: dango-0.0.1-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.2.0 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8a7e02098fb60c296809fe15764e76683072e90b403f7a70553927badb479f73 |
|
MD5 | 6a6eb18c95f9f1cabafa3aab7aebbe99 |
|
BLAKE2b-256 | 4799402739b9c6ef7927c6e9e421a1fc346f692aad271ea6a65a018c1a52cd91 |