Skip to main content

Sentence segmenter that supports ~300 languages

Project description

Sentence segmenter

tests

A sentence segmentation library with wide language support optimized for speed and utility.

Approach

  • If it's a period, it ends a sentence.
  • If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.

However, it is not 'period' for many languages. So we will use a list of known punctuations that can cause a sentence break in as many languages as possible.

We also collect a list of known, popular abbreviations in as many languages as possible.

Sometimes, it is very hard to get the segmentation correct. In such cases this library is opinionated and prefer not segmenting than wrong segmentation. If two sentences are accidentally together, that is ok. It is better than sentence being split in middle. Avoid over engineering to get everything linguistically 100% accurate.

This approach would be suitable for applications like text to speech, machine translation.

Consider this example: We make a good team, you and I. Did you see Albert I. Jones yesterday?

The accurate splitting of this sentence is ["We make a good team, you and I." ,"Did you see Albert I. Jones yesterday?"]

However, to achieve this level precision, complex rules need to be added and it could create side effects. Instead, if we just don't segment between I. Did, it is ok for most of downstream applications.

The sentence segmentation in this library is non-destructive. This means, if the sentences are combined together, you can reconstruct the original text. Line breaks, punctuations and whitespaces are preserved in the output.

Usage

Install the library using

pip install sentencex

Then, any text can be segmented as follows.

from sentencex import segment

text = """
The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development")
"""
print(list(segment("en", text)))

The first argument is language code, second argument is text to segment. The segment method returns an iterator on identified sentences.

Language support

The aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.

Performance

Measured on Golden Rule Set(GRS) for English. Lists are exempted (1. sentence 2. another sentence).

The following libraries are used for benchmarking:

Tokenizer Library English Golden Rule Set score Speed(Avg over 100 runs) in seconds
sentencex_segment 74.36 0.93
mwtokenizer_tokenize 30.77 1.54
blingfire_tokenize 89.74 0.27
nltk_tokenize 66.67 1.86
pysbd_tokenize 97.44 10.57
spacy_tokenize 61.54 2.45
spacy_dep_tokenize 74.36 138.93
stanza_tokenize 87.18 107.51
syntok_tokenize 79.49 4.72

Thanks

License

MIT license. See License.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencex-0.6.1.tar.gz (69.0 kB view details)

Uploaded Source

Built Distribution

sentencex-0.6.1-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file sentencex-0.6.1.tar.gz.

File metadata

  • Download URL: sentencex-0.6.1.tar.gz
  • Upload date:
  • Size: 69.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for sentencex-0.6.1.tar.gz
Algorithm Hash digest
SHA256 4765ece009760f63d0f8c3c2aef08f573880e5d0dc4a5fce60f62219d1ab42cb
MD5 59f599ecb7b5cab5d4267876e7b9d133
BLAKE2b-256 b3d5b18a15aba0ea95e7745ddf87ecab4ed87b5f1ad8cc6f63d1219dcc03d9c1

See more details on using hashes here.

File details

Details for the file sentencex-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: sentencex-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for sentencex-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8a1bccda38d47136ce7e857d45098151633d53c95394502109297000d57bfb9
MD5 03e242c51edd4a45452386a381e3aabd
BLAKE2b-256 c5674ead1837fd22ea19e8461eb49a628761aa8a065768a14cfa2566d3067ebe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page