Skip to main content

sentence segmenter for japanese text

Project description

ja_sentence_segmenter

日本語のテキストに対して、ルールベースによる文区切り(sentence segmentation)を行います。

Getting Started

Prerequisites

  • Python 3.6+

Installing

pip install ja_sentence_segmenter

Usage

import functools

from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation

split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_tail_no = functools.partial(concatenate_matching, former_matching_rule=r"^(?P<result>.+)(の)$", remove_former_matched=False)
segmenter = make_pipeline(normalize, split_newline, concat_tail_no, split_punc2)

# Golden Rule: Simple period to end sentence #001 (from https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/japanese_spec.rb#L6)
text1 = "これはペンです。それはマーカーです。"
print(list(segmenter(text1)))
> ["これはペンです。", "それはマーカーです。"]

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Contributing

TODO

License

MIT

Acknowledgments

テキストの正規化処理

テキスト正規化のコードは、mecab-ipadic-NEologdの以下のWIKIを参考に一部修正を加えています。

サンプルコードの提供者であるhideaki-t氏とoverlast氏に感謝します。

https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast

文区切り(sentence segmentation)のルール

文区切りのルールとして、Pragmatic Segmenterの日本語ルールを参考にしました。

https://github.com/diasks2/pragmatic_segmenter#golden-rules-japanese

また、以下のテストコード中で用いられているテストデータを、本PJのテストコードで利用しました。

https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/japanese_spec.rb

作者のKevin S. Dias氏とコントリビュータの方々に感謝します。

Thanks to Kevin S. Dias and contributors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ja_sentence_segmenter-0.0.2.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

ja_sentence_segmenter-0.0.2-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file ja_sentence_segmenter-0.0.2.tar.gz.

File metadata

  • Download URL: ja_sentence_segmenter-0.0.2.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.3 CPython/3.7.4 Linux/4.15.0-72-generic

File hashes

Hashes for ja_sentence_segmenter-0.0.2.tar.gz
Algorithm Hash digest
SHA256 caa373b504ff3f906688ac9cb9a761935a84e03c20d2a9741cfcb4f778859e35
MD5 7c0e5901021950404b0886f01f057159
BLAKE2b-256 738d917bdefdaae77934c8b84293e6a40ee1609d28c49535d02c0efd55fe0748

See more details on using hashes here.

File details

Details for the file ja_sentence_segmenter-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ja_sentence_segmenter-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 237d99e79f9fe0e858310c088d9fb7d74e765954769ae6473197a0bd5ace4edb
MD5 62f2a318fdd98d0f9745f7ae613e60c4
BLAKE2b-256 0f586268f9249100f2f269c04c379d0b1b6c2cd2e5c09ddba0bfa7d0173c0059

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page