Skip to main content

Segment text with Unicode TR29-compliant segmenters.

Project description

Unicode Segment

Segment text with Unicode TR29-compliant segmenters.

Segmenters are available for grapheme clusters (GraphemeSegmenter), words (WordSegmenter), and sentences (SentenceSegmenter).

Segmenters pass all tests from the Unicode Character Database (UCD) test data files as of Unicode 17.0.0, but do not currently support locale-specific tailoring.

Usage

from unicode_segment import SentenceSegmenter

segmenter = SentenceSegmenter()

text = (
    "“How do you know I’m mad?”, said Alice. “You must be,” said the Cat, "
    "“or you wouldn’t have come here.”"
)
assert list(segmenter.segment(text)) == [
    (0, "“How do you know I’m mad?”, said Alice. "),
    (40, "“You must be,” said the Cat, “or you wouldn’t have come here.”"),
]

text = (
    "此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀物种的"
    "保存和景观似乎都闪烁着倒影。潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。随着大"
    "幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。"
)
assert list(segmenter.segment(text)) == [
    (
        0,
        "此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀"
        "物种的保存和景观似乎都闪烁着倒影。",
    ),
    (
        63,
        "潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。",
    ),
    (
        96,
        "随着大幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。",
    ),
]

text = (
    "ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው ዓለም "
    "ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ ድንኳኖች ተሞልተው "
    "እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ ያለፈውን፣ ሸራዎችን እያስተካከሉ "
    "እና ቅዝቃዜን ሰጥቷል። የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ "
    "ይጋብዛሉ። አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ "
    "በታማኝነት መፈጸሙን አረጋግጠዋል."
)
assert list(segmenter.segment(text)) == [
    (
        0,
        "ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው "
        "ዓለም ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ "
        "ድንኳኖች ተሞልተው እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ "
        "ያለፈውን፣ ሸራዎችን እያስተካከሉ እና ቅዝቃዜን ሰጥቷል። ",
    ),
    (
        219,
        "የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ ይጋብዛሉ። ",
    ),
    (
        278,
        "አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ በታማኝነት መፈጸሙን "
        "አረጋግጠዋል.",
    ),
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicode_segment-0.4.3.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicode_segment-0.4.3-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file unicode_segment-0.4.3.tar.gz.

File metadata

  • Download URL: unicode_segment-0.4.3.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for unicode_segment-0.4.3.tar.gz
Algorithm Hash digest
SHA256 9d808060eb5df76249995ae66f95f3b9c1c09f7700bb4b15e54307f981822e74
MD5 eddf47627fc8a25488d9c47955ceb1f9
BLAKE2b-256 482f8b6ba20c7665a7e768d450a53fd59abea6183a0778bc66aa0e0899e3bc0b

See more details on using hashes here.

File details

Details for the file unicode_segment-0.4.3-py3-none-any.whl.

File metadata

File hashes

Hashes for unicode_segment-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ba168c8e16f058cf4b8760f350379e8d4bb2024ec2108451c1532761d34b7266
MD5 f4098365e1da97de7099927fa9519bba
BLAKE2b-256 11528f0277b2bd3b15ccca006a34555c9c437aa99ca2cdb964078f20d540d0de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page