A rule based sentence segmentation library.
Project description
cutters
A rule based sentence segmentation library.
Python bindings for the cutters library written in Rust.
🚧 This library is experimental. 🚧
Features
- Full UTF-8 support.
- Robust parsing.
- Language specific rules (each defined by its own PEG).
- Fast and memory efficient parsing via the pest library.
- Sentences can contain quotes which can contain subsentences.
Supported languages
- Croatian (standard)
- English (standard)
There is also an additional Baseline
"language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.
Example
After installing the cutters
package with pip
, usage is simple (note that the language is defined via ISO 639-1 two letter language codes).
import cutters
text = """
Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način."
""";
sentences = cutters.cut(text, "hr");
print(sentences);
This results in the following output (note that the str
struct fields are &str
).
[Sentence {
str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
quotes: [],
}, Sentence {
str: "St. Louis 9LX je događaj u svijetu šaha.",
quotes: [],
}, Sentence {
str: "To je prof.dr.sc. Ivan Horvat.",
quotes: [],
}, Sentence {
str: "Volim rock, punk, funk, pop itd.",
quotes: [],
}, Sentence {
str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
quotes: [
Quote {
str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
sentences: [
"Sve sretne obitelji nalik su jedna na drugu.",
"Svaka nesretna obitelj nesretna je na svoj način.",
],
},
],
}]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cutters-0.1.4.tar.gz
(6.1 kB
view hashes)
Built Distributions
Close
Hashes for cutters-0.1.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1bed35f8823161fb8afd536e523bb33685b45efb009286ababf1e472442e4f54 |
|
MD5 | 2cfde05fa0e32ffd136945c0e6ded2ea |
|
BLAKE2b-256 | 426d6b57684336146831789e126650da6459e5429d441020b33c79d6ad23b55c |
Close
Hashes for cutters-0.1.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c60ebceaa23317aedce23ea88b0193b9f58899e62eeff5e95a3d5166dfc1622b |
|
MD5 | 2e98ed42793d2a484ded943fd3835acd |
|
BLAKE2b-256 | 1ee072bc78485180310142e14b4207cea5525ed6e6fb590d2817865afc4ca3bd |
Close
Hashes for cutters-0.1.4-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c8443b0e1e302bda90dff217d3d8d6fb6c1ceac597ce11934410d3ddb2d2d09 |
|
MD5 | 7b8ba7d3cffb69d20cd6c88291d17d24 |
|
BLAKE2b-256 | 4271a4bfedf11a9841dcbf70cef162e80eae6196b623f88e3f25a781b3dfe672 |
Close
Hashes for cutters-0.1.4-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a85fc83779316fcd86cf134cfe1547ec26a02a4aafca92cf8aaa3b9587d2ac9 |
|
MD5 | 74471865958a95231beb8f642dcba5e3 |
|
BLAKE2b-256 | 3595553e319ffe4fac569bf94e9467305d9bb234f10d8fccf0891ab50e2424f2 |
Close
Hashes for cutters-0.1.4-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44f797ad8569ccb2ea1f1a068c1bb51f2cf62a7e0147634ad71ec911ec59edd5 |
|
MD5 | cffc79d03ac9e1a074ff472f2906b9d0 |
|
BLAKE2b-256 | 02485ff909f56a64e74c6435bd36eacd22df5738ddde3d7a45f7b5e9598aa19e |
Close
Hashes for cutters-0.1.4-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b12ccf815a7ebc35acef20264c566a5895aae0ff6658a222a7b8d5fffcce226 |
|
MD5 | 1ea8225a562576adb21d50986e5bbcb2 |
|
BLAKE2b-256 | 08aa487fe0d836fb9feea531334c0d8fe1973dae2b9ae2c7f0b84ec8e25875c1 |
Close
Hashes for cutters-0.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 446e56edb2c18574f003c65ecf9dd81a9e301b7704354cec50d48ecdb02c2112 |
|
MD5 | 4e7321521a2ada18808a4ecef625c16d |
|
BLAKE2b-256 | 3ff7b90c113f5690868b4094c630c9d501567b6f78398550c367497bce2e29f0 |
Close
Hashes for cutters-0.1.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9bb31c78114d62b70c833850587f81de7ce45b3353d39f95d41cd25ade9a60a4 |
|
MD5 | f2df7c49391a9991d4c47554bb64e94e |
|
BLAKE2b-256 | 682f4c1eda3ea782ddbaa37a24a18cf526993a62798b4e1a2fa019799bf770b5 |
Close
Hashes for cutters-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50a279c5ec2f77bc2bc1e28bf5b1a0ec4935778a5603d60cabfdb3dfffe4c4c6 |
|
MD5 | 58b5311eaa35304e92bddf0be2baa088 |
|
BLAKE2b-256 | dbd0b38f6ef9dec81fcacecbdf16a34f2468e05a4659212fbf99549d920e346c |
Close
Hashes for cutters-0.1.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0980b8a025236b422706bfafee1f5e9ba000673a595dba06ab162a4295c5c71 |
|
MD5 | bba2d2086b1450fa4151c8f5378a2d83 |
|
BLAKE2b-256 | 80c031e4c1f47c7f7a5dd3066d1c33050399f70be6928fe460defd205a011fb0 |
Close
Hashes for cutters-0.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a0b1de4f8b97cf72cbbb75e7fcb1bc0d93bec05229c5dd114aa4e5afc1bf676 |
|
MD5 | 0edfc1eea8148081560a78d50af29109 |
|
BLAKE2b-256 | 0785c2e5638dee4d5f6433ab72695c5c88193941ec1a5485523af120c87c7109 |
Close
Hashes for cutters-0.1.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a4142ee1ee46f4dc53a68a7bbec01169ceab69f86aaf336c4a5e611beb24611 |
|
MD5 | e45ea313c2dffe263cadc2aa40afe5e0 |
|
BLAKE2b-256 | 7a62e1123b8343f59ed13522542831012abf3abe911d997d4f4775afe72c6a94 |
Close
Hashes for cutters-0.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 763a90d7d36bfaad79de1e4ebdf4998184c4e331788753aa1d40395652c7e8c1 |
|
MD5 | a691e705a76c18ba68d5e098ed281d5d |
|
BLAKE2b-256 | 83167b35f5bd72bf108da4774bca60d36753816057cab5f6370a0c50bd5950aa |
Close
Hashes for cutters-0.1.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37339e192aa917641403fd5fd5ec0464e37ad7a1b808ef71c14afc94603947d8 |
|
MD5 | c6f3138f639bf49e7b1fb82f08ffd969 |
|
BLAKE2b-256 | c4461441a1de3a8e2eeea2c5d4944eca3f7929a823c36097cef8460933882b5b |
Close
Hashes for cutters-0.1.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9d597d673c296f9d6baa6c30c733596ac05170a933bb90c515995e0e51edefb |
|
MD5 | ee438402f10934527cbf91a8eb6888f9 |
|
BLAKE2b-256 | 20e3cf6999c9efd612a9b14ce154ab851ae28a8e2e88a7c4b0ca2b948b889b91 |
Close
Hashes for cutters-0.1.4-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 00cb138b4b17a59d96135ddf2fc0fd41db1809f3622fc3d1790b158aedf925f1 |
|
MD5 | cb7656d32144869c75eedc8560f74594 |
|
BLAKE2b-256 | 7d247d8a89ecc795d00680e304b7ca2c8c47adee8583ea43eff87befb6deac29 |
Close
Hashes for cutters-0.1.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 90ce98e6d349dc6b045430750947bff3a23978561f5ec41fdf8f888d91c94a08 |
|
MD5 | 43a00b7c1567c59f62b271b8bbee8ada |
|
BLAKE2b-256 | b0ad2b57391a4b2bb8ac206990c4a3d3c8b0c89558dcdc4273fecf8efee77b1a |
Close
Hashes for cutters-0.1.4-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d09e7165d06a78b70a438548fb2cacba650f26d9c6dcd11c845b46eaf72a3c8 |
|
MD5 | c0726efaff5e623aee63896bd93c7971 |
|
BLAKE2b-256 | 9e1f3f269fea8e032735ece559c70d88119061b97572b63ea2c7e1b125d8aae4 |