A rule based sentence segmentation library.
Project description
cutters
A rule based sentence segmentation library.
Python bindings for the cutters library written in Rust.
🚧 This library is experimental. 🚧
Features
- Full UTF-8 support.
- Robust parsing.
- Language specific rules (each defined by its own PEG).
- Fast and memory efficient parsing via the pest library.
- Sentences can contain quotes which can contain subsentences.
Supported languages
- Croatian (standard)
- English (standard)
There is also an additional Baseline
"language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.
Example
After installing the cutters
package with pip
, usage is simple (note that the language is defined via ISO 639-1 two letter language codes).
import cutters
text = """
Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način."
""";
sentences = cutters.cut(text, "hr");
print(sentences);
This results in the following output (note that the str
struct fields are &str
).
[Sentence {
str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
quotes: [],
}, Sentence {
str: "St. Louis 9LX je događaj u svijetu šaha.",
quotes: [],
}, Sentence {
str: "To je prof.dr.sc. Ivan Horvat.",
quotes: [],
}, Sentence {
str: "Volim rock, punk, funk, pop itd.",
quotes: [],
}, Sentence {
str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
quotes: [
Quote {
str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
sentences: [
"Sve sretne obitelji nalik su jedna na drugu.",
"Svaka nesretna obitelj nesretna je na svoj način.",
],
},
],
}]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cutters-0.1.3.tar.gz
(8.3 kB
view hashes)
Built Distributions
Close
Hashes for cutters-0.1.3-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94bdda60e4a464ab3e92062d8a7cf96ed03d69c04c1c95bff2e49ff25c1b09c3 |
|
MD5 | 1b1a6c2550bcf591ccf6ee83e278854d |
|
BLAKE2b-256 | 279bd5c3799a69a5876479bd324f44f59e33352dbb4c0299096c0564aeb8abb7 |
Close
Hashes for cutters-0.1.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a2297053fd0225929e04c239fed1e6a4870d168489a5a4db1c5f2b0904e0d7a |
|
MD5 | 4eea4d58a770a7c558a9837e85513a93 |
|
BLAKE2b-256 | 374f21ed25dfc2632280eab5df5e29e600382494338df1c628da069b6dcb36f2 |
Close
Hashes for cutters-0.1.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dbc213aa9279665af264edd4edc1065930d8f903a180c96277a5eafb1a0974d |
|
MD5 | 7165cac85e611cf340cad5554a546d3e |
|
BLAKE2b-256 | 2d5e9bc8f2ae6e098e37f3663596ddc555371b0b34d3fcc8e3abf9141f5a4ea8 |
Close
Hashes for cutters-0.1.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0e38e369668f7da54ebdc3b00158214389a5b62cc3cd46f1837fd7edacf05e0 |
|
MD5 | c0761449886eabc078358a84ba42bfe2 |
|
BLAKE2b-256 | cab53fbaa4c0e973e1e535517f48030a7889b44106a75e526c1eebaeeff889b9 |
Close
Hashes for cutters-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e50e1bae3fe17d3628d93caddee58a1c8b29fae5b9fda46e8441bcffb880553 |
|
MD5 | 08b0738a139eb29edd7dd2d9fa9d9773 |
|
BLAKE2b-256 | 808e5fbf587e704ac598726b07ad9d3a0b44475f085e42eb3268c7536730aced |