A rule based sentence segmentation library.
Project description
cutters
A rule based sentence segmentation library.
Python bindings for the cutters library written in Rust.
🚧 This library is experimental. 🚧
Features
- Full UTF-8 support.
- Robust parsing.
- Language specific rules (each defined by its own PEG).
- Fast and memory efficient parsing via the pest library.
- Sentences can contain quotes which can contain subsentences.
Supported languages
- Croatian (standard)
- English (standard)
There is also an additional Baseline
"language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.
Example
After installing the cutters
package with pip
, usage is simple (note that the language is defined via ISO 639-1 two letter language codes).
import cutters
text = """
Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način."
""";
sentences = cutters.cut(text, "hr");
print(sentences);
This results in the following output (note that the str
struct fields are &str
).
[Sentence {
str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
quotes: [],
}, Sentence {
str: "St. Louis 9LX je događaj u svijetu šaha.",
quotes: [],
}, Sentence {
str: "To je prof.dr.sc. Ivan Horvat.",
quotes: [],
}, Sentence {
str: "Volim rock, punk, funk, pop itd.",
quotes: [],
}, Sentence {
str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
quotes: [
Quote {
str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
sentences: [
"Sve sretne obitelji nalik su jedna na drugu.",
"Svaka nesretna obitelj nesretna je na svoj način.",
],
},
],
}]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cutters-0.1.2.tar.gz
(10.1 kB
view hashes)
Built Distributions
Close
Hashes for cutters-0.1.2-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7fcbfe0d671326663bc34d35477d3b53facb58eeb0cd01bb50bf7a6888522e3e |
|
MD5 | 132a6a38ee851378a96921d77f3f5f94 |
|
BLAKE2b-256 | c1627a4605ddf0d58d6745176e2671d193ed1778cc0c86cada55aadc33cb1a4b |
Close
Hashes for cutters-0.1.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d00e313123e02a8eab751475d091a60e18a3fb467f8f2bc5ce33cdc7f6ee52ba |
|
MD5 | 2bdc10668e5bcd1ce67b417d6c952311 |
|
BLAKE2b-256 | 415e9935f555d09fffddca1281303eee94aa08c0cbcde161084d781196a27a69 |
Close
Hashes for cutters-0.1.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b075585fad8e13151e71d178d3161dce26141137441a1a6e965ecdaf58767d7c |
|
MD5 | 35727bf6b504541e0aa30fe9211245a7 |
|
BLAKE2b-256 | 1dfbb0553902f03581fdfa5c5f452cd4f854d2481c569d7933f6dfe0ef0f49a5 |
Close
Hashes for cutters-0.1.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e87eecd5d8401185383055ac76cbcf219ceb9e8facf64c46adb8aff911fab63f |
|
MD5 | d6a69bb22787779988abdc51d82c02fb |
|
BLAKE2b-256 | 5a12c24db3b6abd8d02578a18e8a683429850b9fc937d6cd6cd2b694c2442313 |
Close
Hashes for cutters-0.1.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac89e612a548dcfe052059aa51b44b23f8286276df3b9086c9c5997eb61f6790 |
|
MD5 | ca21c396c8172147bde38ae01b965af1 |
|
BLAKE2b-256 | 5aae3c6d8673e048f5ddb84f8ad0b86086a78d3f96d61d0d6ef6500e1e2c46b8 |
Close
Hashes for cutters-0.1.2-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46ee011ea58413f4c2d18bc8b626e04d3d60033f74448d9f0f7bcbbe25cf8f37 |
|
MD5 | 97ac7c2ac35f29a5aee265bba40c9125 |
|
BLAKE2b-256 | f57284530ea24e765008a7cff02239eb180002502a80dd8669ff6b4913b6cc53 |