Segment text with Unicode TR29-compliant segmenters.
Project description
Unicode Segment
Segment text with Unicode TR29-compliant segmenters.
Segmenters are available for grapheme clusters (GraphemeSegmenter), words (WordSegmenter), and sentences (SentenceSegmenter).
Segmenters pass all tests from the Unicode Character Database (UCD) test data files as of Unicode 17.0.0, but do not currently support locale-specific tailoring.
Usage
from unicode_segment import SentenceSegmenter
segmenter = SentenceSegmenter()
text = (
"“How do you know I’m mad?”, said Alice. “You must be,” said the Cat, "
"“or you wouldn’t have come here.”"
)
assert list(segmenter.segment(text)) == [
(0, "“How do you know I’m mad?”, said Alice. "),
(40, "“You must be,” said the Cat, “or you wouldn’t have come here.”"),
]
text = (
"此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀物种的"
"保存和景观似乎都闪烁着倒影。潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。随着大"
"幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。"
)
assert list(segmenter.segment(text)) == [
(
0,
"此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀"
"物种的保存和景观似乎都闪烁着倒影。",
),
(
63,
"潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。",
),
(
96,
"随着大幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。",
),
]
text = (
"ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው ዓለም "
"ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ ድንኳኖች ተሞልተው "
"እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ ያለፈውን፣ ሸራዎችን እያስተካከሉ "
"እና ቅዝቃዜን ሰጥቷል። የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ "
"ይጋብዛሉ። አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ "
"በታማኝነት መፈጸሙን አረጋግጠዋል."
)
assert list(segmenter.segment(text)) == [
(
0,
"ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው "
"ዓለም ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ "
"ድንኳኖች ተሞልተው እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ "
"ያለፈውን፣ ሸራዎችን እያስተካከሉ እና ቅዝቃዜን ሰጥቷል። ",
),
(
219,
"የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ ይጋብዛሉ። ",
),
(
278,
"አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ በታማኝነት መፈጸሙን "
"አረጋግጠዋል.",
),
]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unicode_segment-0.4.3.tar.gz.
File metadata
- Download URL: unicode_segment-0.4.3.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d808060eb5df76249995ae66f95f3b9c1c09f7700bb4b15e54307f981822e74
|
|
| MD5 |
eddf47627fc8a25488d9c47955ceb1f9
|
|
| BLAKE2b-256 |
482f8b6ba20c7665a7e768d450a53fd59abea6183a0778bc66aa0e0899e3bc0b
|
File details
Details for the file unicode_segment-0.4.3-py3-none-any.whl.
File metadata
- Download URL: unicode_segment-0.4.3-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba168c8e16f058cf4b8760f350379e8d4bb2024ec2108451c1532761d34b7266
|
|
| MD5 |
f4098365e1da97de7099927fa9519bba
|
|
| BLAKE2b-256 |
11528f0277b2bd3b15ccca006a34555c9c437aa99ca2cdb964078f20d540d0de
|