Skip to main content

A flexible sentence segmentation library using CRF model and regex rules

Project description

sentsplit

A flexible sentence segmentation library using CRF model and regex rules

This library allows splitting of text paragraphs into sentences. It is built with the following desiderata:

  • Be able to extend to new languages or "types" of sentences from data alone by learning a conditional random field (CRF) model.
  • Also provide functionality to segment (or not to segment) lines based on regular expression rules (referred as segment_regexes and prevent_regexes, respectively).
  • Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.

All in all, the library aims to benefit from the best of both worlds: data-driven and rule-based approaches.

Installation

Supports Python 3.6+

# stable
pip install sentsplit

# bleeding-edge
pip install git+https://github.com/zaemyung/sentsplit

Uses python-crfsuite, which, in turn, is built upon CRFsuite.

Segmentation

CLI

$ sentsplit segment -l lang_code -i /path/to/input_file  # outputs to /path/to/input_file.segment
$ sentsplit segment -l lang_code -i /path/to/input_file -o /path/to/output_file

$ sentsplit segment -h  # prints out the detailed usage

Python Library

from sentsplit.segment import SentSplit

# use default setting
sent_splitter = SentSplit(lang_code)

# override default setting - see "Features" for detail
sent_splitter = SentSplit(lang_code, **overriding_kwargs)

# segment a single line
sentences = sent_splitter.segment(line)

# can also segment a list of lines
sentences = sent_splitter.segment([lines])

Features

The behavior of segmentation can be adjusted by the following arguments:

  • mincut: a line is not segmented if its character-level length is smaller than mincut, preventing too short sentences.
  • maxcut: a line is segmented if its character-level length is greater or equal to maxcut, preventing too long sentences.
  • strip_spaces: trim any white spaces in front and end of a sentence; does not guarantee exact reconstruction of original passages.
  • handle_multiple_spaces: substitute multiple spaces with a single space, perform segmentation, and recover the original spaces.
  • segment_regexes: segment at either start or end index of the matched group defined by the regex patterns.
  • prevent_regexes: a line is not segmented at characters that fall within the matching group(s) captured by the regex patterns.
  • prevent_word_split: a line is not segmented at characters that are within a word where the word boundary is denoted by white spaces around it or a punctuation; may not be suitable for languages (e.g. Chinese, Japanese, Thai) that do not use spaces to differentiate words.

Segmentation is performed by first applying a trained CRF model to a line, where each character in the line is labelled as either O or EOS. EOS label indicates the position for segmentation.

Note that prevent_regexes is applied after segment_regexes, meaning that the segmentation positions captured by segment_regexes can be overridden by prevent_regexes.

Creating a New SentSplit Model

Creating a new model involves first training a CRF model on a dataset of clean sentences, followed by (optionally) adding or modifying the feature arguments for better performance.

Training a CRF Model

First, prepare a corpus file where a single line corresponds to a single sentence. Then, a CRF model can be trained by running a command:

sentsplit train -l lang_code -c corpus_file_path  # outputs to {corpus_file_path}.{lang_code}-{ngram}-gram-{YearMonthDate}.model

sentsplit train -h  # prints out the detailed usage

The following arguments are used to set the training setting:

  • ngram: maximum ngram features used for CRF model; default is 5.
  • crf_max_iteration: maximum number of CRF iteration for training; default is 50.
  • sample_min_length: when preparing an input sample for CRF model, gold sentences are concatenated to form a longer sample with a length greater than sample_min_length; default is 450.
  • add_depunctuated_samples: when set to True, randomly (30% chance) remove the punctuation of the current sentence before concatenation; default is False. May only be suitable for languages (e.g. Korean, Japanese) that have specific endings for sentences.
  • add_despaced_samples: when set to True, with 35% chance, current sentence is concatenated to input sample without a prepending white space; default is False.

Setting Configuration

Refer to the base_config in config.py. Append a new config to the file, adjusting the arguments accordingly if needed.

A newly created model can also be called directly in codes by passing the kwargs accordingly:

from sentsplit.segment import SentSplit

sent_splitter = SentSplit(lang_code, model='path/to/model', ...)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentsplit-1.0.1.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

sentsplit-1.0.1-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file sentsplit-1.0.1.tar.gz.

File metadata

  • Download URL: sentsplit-1.0.1.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.8

File hashes

Hashes for sentsplit-1.0.1.tar.gz
Algorithm Hash digest
SHA256 ab7b3d40e69d1239e7220eea37a9cc2aab70c723bac407db525cdaba63219271
MD5 0753db04b979b35dcdf903353cb495be
BLAKE2b-256 074a1f6231c1c36df20bbb2398f00167faac422e94625259877c8b1edf737d0e

See more details on using hashes here.

File details

Details for the file sentsplit-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: sentsplit-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.6.8

File hashes

Hashes for sentsplit-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df6ad6e1f38f127d921b9806c8309794d39310c7fee1bc6a97ed4d89e4fdbfee
MD5 cc6125f09a712519d652ff5c4933f206
BLAKE2b-256 78f4734d88f1377a6b843dbf2f8009257ddac9bfd353dce6623d49fa70c24300

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page