sentence segmenter for japanese text
Project description
ja_sentence_segmenter
日本語のテキストに対して、ルールベースによる文区切り(sentence segmentation)を行います。
Getting Started
Prerequisites
- Python 3.6+
Installing
pip install ja_sentence_segmenter
Usage
import functools
from ja_sentence_segmenter.common.pipeline import make_pipeline
from ja_sentence_segmenter.concatenate.simple_concatenator import concatenate_matching
from ja_sentence_segmenter.normalize.neologd_normalizer import normalize
from ja_sentence_segmenter.split.simple_splitter import split_newline, split_punctuation
split_punc2 = functools.partial(split_punctuation, punctuations=r"。!?")
concat_tail_no = functools.partial(concatenate_matching, former_matching_rule=r"^(?P<result>.+)(の)$", remove_former_matched=False)
segmenter = make_pipeline(normalize, split_newline, concat_tail_no, split_punc2)
# Golden Rule: Simple period to end sentence #001 (from https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/japanese_spec.rb#L6)
text1 = "これはペンです。それはマーカーです。"
print(list(segmenter(text1)))
> ["これはペンです。", "それはマーカーです。"]
Versioning
We use SemVer for versioning. For the versions available, see the tags on this repository.
Contributing
TODO
License
MIT
Acknowledgments
テキストの正規化処理
テキスト正規化のコードは、mecab-ipadic-NEologdの以下のWIKIを参考に一部修正を加えています。
サンプルコードの提供者であるhideaki-t氏とoverlast氏に感謝します。
https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
文区切り(sentence segmentation)のルール
文区切りのルールとして、Pragmatic Segmenterの日本語ルールを参考にしました。
https://github.com/diasks2/pragmatic_segmenter#golden-rules-japanese
また、以下のテストコード中で用いられているテストデータを、本PJのテストコードで利用しました。
作者のKevin S. Dias氏とコントリビュータの方々に感謝します。
Thanks to Kevin S. Dias and contributors.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ja_sentence_segmenter-0.0.2.tar.gz
.
File metadata
- Download URL: ja_sentence_segmenter-0.0.2.tar.gz
- Upload date:
- Size: 7.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.3 CPython/3.7.4 Linux/4.15.0-72-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | caa373b504ff3f906688ac9cb9a761935a84e03c20d2a9741cfcb4f778859e35 |
|
MD5 | 7c0e5901021950404b0886f01f057159 |
|
BLAKE2b-256 | 738d917bdefdaae77934c8b84293e6a40ee1609d28c49535d02c0efd55fe0748 |
File details
Details for the file ja_sentence_segmenter-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: ja_sentence_segmenter-0.0.2-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.0.3 CPython/3.7.4 Linux/4.15.0-72-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 237d99e79f9fe0e858310c088d9fb7d74e765954769ae6473197a0bd5ace4edb |
|
MD5 | 62f2a318fdd98d0f9745f7ae613e60c4 |
|
BLAKE2b-256 | 0f586268f9249100f2f269c04c379d0b1b6c2cd2e5c09ddba0bfa7d0173c0059 |