Rule-based sentence tokenizer for Russian language
Project description
ru_sent_tokenize
A simple and fast rule-based sentence segmentation. Tested on OpenCorpora and SynTagRus datasets.
Installation
pip install rusenttokenize
Running
>>> from rusenttokenize import ru_sent_tokenize
>>> ru_sent_tokenize('Эта шоколадка за 400р. ничего из себя не представляла. Артём решил больше не ходить в этот магазин')
['Эта шоколадка за 400р. ничего из себя не представляла.', 'Артём решил больше не ходить в этот магазин']
Metrics
The tokenizer has been tested on OpenCorpora and SynTagRus. There are two important metrics.
Precision. First one is we took single sentences from the datasets and measured how many times tokenizer didn't split them.
Recall. Second metric is we took two consecutive sentences from the datasets and joined each pair with a space characted. We measured how many times tokenizer correctly splitted a long sentence into two.
| tokenizer | OpenCorpora | SynTagRus | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | Execution Time (sec) | Precision | Recall | Execution Time (sec) | |
| nltk.sent_tokenize | 94.30 | 86.06 | 8.67 | 98.15 | 94.95 | 5.07 |
| nltk.sent_tokenize(x, language='russian') | 95.53 | 88.37 | 8.54 | 98.44 | 95.45 | 5.68 |
| bureaucratic-labs.segmentator.split | 97.16 | 88.62 | 359 | 96.79 | 92.55 | 210 |
| ru_sent_tokenize | 98.73 | 93.45 | 4.92 | 99.81 | 98.59 | 2.87 |
Notebook shows how the table above was calculated
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rusenttokenize-0.0.5.tar.gz.
File metadata
- Download URL: rusenttokenize-0.0.5.tar.gz
- Upload date:
- Size: 5.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b061b0ea40e880558dfe35a0040010c021007e1779517b25c8d47ba145c028c3
|
|
| MD5 |
9058f7d375e4c18278c3733e8dd10100
|
|
| BLAKE2b-256 |
6d761226e1ddc11ad492a191664a4926c607bcbf1e5b352134ca6f83c4af8205
|
File details
Details for the file rusenttokenize-0.0.5-py3-none-any.whl.
File metadata
- Download URL: rusenttokenize-0.0.5-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcd604d6bc26334d46f87be1b0cd68022650c0a5dc613a39acf9d9da074d9f6b
|
|
| MD5 |
0af470fc385d8a444f3dcae5dfb01561
|
|
| BLAKE2b-256 |
254ca2f00be5def774a3df2e5387145f1cb54e324607ec4a7e23f573645946e7
|