Skip to main content

Sentence segmentation for japanese text

Project description

Hasami

Hasami is a tool to perform sentence segmentation on japanese text.

  • In addition to simply splitting on sentence-ending markers like !?。 it will treat runs of sentence-ending characters as a single sentence ending.
  • It will not split enclosed sentences, i.e. those in quotes or parentheses.
  • It can be configured with custom sentence-ending markers and enclosures in case the defaults don't cover your needs.
  • You can define exceptions for when not to split sentences.

Installation

pip install hasami

Usage

A simple command line interface is provided to use the functionality without having to write your own script. Input is read from stdin or from a file.

$ echo "これが最初の文。これは二番目の文。これが最後の文。" | tee input.txt | hasami
これが最初の文。
これは二番目の文。
これが最後の文。

$ hasami input.txt
これが最初の文。
これは二番目の文。
これが最後の文。

To use in your code:

import hasami

hasami.segment_sentences('これが最初の文。これは二番目の文。これが最後の文。')
# => ['これが最初の文。', 'これは二番目の文。', 'これが最後の文。']

More complex examples will follow soon, please refer to the test cases in the meantime.

License

Licensed under the BSD-3-Clause License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hasami-0.0.1.tar.gz (4.3 kB view hashes)

Uploaded Source

Built Distribution

hasami-0.0.1-py3-none-any.whl (6.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page