Skip to main content

Rule-based sentence tokenizer for Russian language

Project description

ru_sent_tokenize

A simple and fast rule-based sentence segmentation. Tested on OpenCorpora and SynTagRus datasets.

Installation

pip install rusenttokenize

Running

>>> from rusenttokenize import ru_sent_tokenize
>>> ru_sent_tokenize('Эта шоколадка за 400р. ничего из себя не представляла. Артём решил больше не ходить в этот магазин')
['Эта шоколадка за 400р. ничего из себя не представляла.', 'Артём решил больше не ходить в этот магазин']

Metrics

The tokenizer has been tested on OpenCorpora and SynTagRus. There are two important metrics.

Precision. First one is we took single sentences from the datasets and measured how many times tokenizer didn't split them.

Recall. Second metric is we took two consecutive sentences from the datasets and joined each pair with a space characted. We measured how many times tokenizer correctly splitted a long sentence into two.

tokenizer OpenCorpora SynTagRus
Precision Recall Execution Time (sec) Precision Recall Execution Time (sec)
nltk.sent_tokenize 94.30 86.06 8.67 98.15 94.95 5.07
nltk.sent_tokenize(x, language='russian') 95.53 88.37 8.54 98.44 95.45 5.68
bureaucratic-labs.segmentator.split 97.16 88.62 359 96.79 92.55 210
ru_sent_tokenize 98.73 93.45 4.92 99.81 98.59 2.87

Notebook shows how the table above was calculated

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rusenttokenize-0.0.5.tar.gz (5.9 kB view details)

Uploaded Source

Built Distribution

rusenttokenize-0.0.5-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file rusenttokenize-0.0.5.tar.gz.

File metadata

File hashes

Hashes for rusenttokenize-0.0.5.tar.gz
Algorithm Hash digest
SHA256 b061b0ea40e880558dfe35a0040010c021007e1779517b25c8d47ba145c028c3
MD5 9058f7d375e4c18278c3733e8dd10100
BLAKE2b-256 6d761226e1ddc11ad492a191664a4926c607bcbf1e5b352134ca6f83c4af8205

See more details on using hashes here.

File details

Details for the file rusenttokenize-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for rusenttokenize-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 fcd604d6bc26334d46f87be1b0cd68022650c0a5dc613a39acf9d9da074d9f6b
MD5 0af470fc385d8a444f3dcae5dfb01561
BLAKE2b-256 254ca2f00be5def774a3df2e5387145f1cb54e324607ec4a7e23f573645946e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page