Sentence boundary disambiguation tool for Japanese texts

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Bunkai

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.

Quick Start

$ pip install -U bunkai
$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
    | bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。

Feed a document as one line by using ▁ (U+2581) for line breaks. The output shows sentence boundaries with │ (U+2502).

If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model. First time, please setup a model.

$ bunkai --model bunkai-model-directory --setup

Then, please designate the directory.

$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。

For more information, see examples or documents.

References

Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]

License

Apache License 2.0

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.5.7

Feb 9, 2023

1.5.5

Sep 15, 2022

1.5.4

Jul 19, 2022

1.5.2

Apr 14, 2022

1.5.1

Apr 14, 2022

1.5.0

Apr 13, 2022

1.4.5

Feb 2, 2022

1.4.4

Feb 2, 2022

1.4.3

Jul 28, 2021

1.4.2

Jul 27, 2021

1.4.1

Jul 15, 2021

1.4.0

Jul 9, 2021

1.3.0

Jun 1, 2021

1.2.0

May 31, 2021

1.1.2

Apr 26, 2021

This version

1.1.1

Apr 26, 2021

1.0.1

Apr 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunkai-1.1.1.tar.gz (42.0 kB view hashes)

Uploaded Apr 26, 2021 Source

Built Distribution

bunkai-1.1.1-py3-none-any.whl (60.1 kB view hashes)

Uploaded Apr 26, 2021 Python 3

Hashes for bunkai-1.1.1.tar.gz

Hashes for bunkai-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`4ca4331236986e0d2b85ce9aa038e64c71e185cf74a4daf92d3eb89c20acab88`
MD5	`63593a8e6ab25cf972e0c21224115694`
BLAKE2b-256	`550cea60299bd621ead1727b2fab586c4b795d1edc353ad7381d5115e490a9f6`

Hashes for bunkai-1.1.1-py3-none-any.whl

Hashes for bunkai-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1a73d4750d1a80c39490e277ebcd2388c9351529fdf7ff1c4967c386fae7939`
MD5	`8f36471f6fd00becfbdfd446715ee62e`
BLAKE2b-256	`472ae27c87947d521bf451b6b98f7ed1185dfcf777fbe5883f9da1aeaf98efe4`