Skip to main content

Sentence boundary disambiguation tool for Japanese texts

Project description

Bunkai

PyPI version Python Versions License

CircleCI Maintainability Test Coverage markdownlint jsonlint yamllint

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.

Quick Start

$ pip install -U bunkai
$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
    | bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。

Feed a document as one line by using (U+2581) for line breaks. The output shows sentence boundaries with (U+2502).

If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model. First time, please setup a model.

$ bunkai --model bunkai-model-directory --setup

Then, please designate the directory.

$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。

For more information, see examples or documents.

References

  • Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunkai-1.1.1.tar.gz (42.0 kB view hashes)

Uploaded Source

Built Distribution

bunkai-1.1.1-py3-none-any.whl (60.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page