Skip to main content

Segment text with Unicode TR29-compliant segmenters.

Project description

Unicode Segment

Segment text with Unicode TR29-compliant segmenters.

Currently segmenters are available for grapheme clusters (GraphemeSegmenter), words (WordSegmenter), and sentences (SentenceSegmenter).

Segmenters pass all tests from the Unicode Character Database (UCD) test data files as of Unicode 17.0.0, but do not currently support locale-specific tailoring.

Usage

from unicode_segment import WordSegmenter, SentenceSegmenter

segmenter = SentenceSegmenter()

text = (
    "“How do you know I’m mad?”, said Alice. “You must be,” said the Cat, "
    "“or you wouldn’t have come here.”"
)
assert list(segmenter.segment(text)) == [
    (0, "“How do you know I’m mad?”, said Alice. "),
    (40, "“You must be,” said the Cat, “or you wouldn’t have come here.”"),
]

text = (
    "此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀物种的"
    "保存和景观似乎都闪烁着倒影。潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。随着大"
    "幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。"
)
assert list(segmenter.segment(text)) == [
    (
        0,
        "此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀"
        "物种的保存和景观似乎都闪烁着倒影。",
    ),
    (
        63,
        "潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。",
    ),
    (
        96,
        "随着大幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。",
    ),
]

text = (
    "ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው ዓለም "
    "ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ ድንኳኖች ተሞልተው "
    "እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ ያለፈውን፣ ሸራዎችን እያስተካከሉ "
    "እና ቅዝቃዜን ሰጥቷል። የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ "
    "ይጋብዛሉ። አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ "
    "በታማኝነት መፈጸሙን አረጋግጠዋል."
)
assert list(segmenter.segment(text)) == [
    (
        0,
        "ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው "
        "ዓለም ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ "
        "ድንኳኖች ተሞልተው እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ "
        "ያለፈውን፣ ሸራዎችን እያስተካከሉ እና ቅዝቃዜን ሰጥቷል። ",
    ),
    (
        219,
        "የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ ይጋብዛሉ። ",
    ),
    (
        278,
        "አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ በታማኝነት መፈጸሙን "
        "አረጋግጠዋል.",
    ),
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicode_segment-0.4.2.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicode_segment-0.4.2-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file unicode_segment-0.4.2.tar.gz.

File metadata

  • Download URL: unicode_segment-0.4.2.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for unicode_segment-0.4.2.tar.gz
Algorithm Hash digest
SHA256 2dd2f48af8a2bf4a02dc266d80e82191b22682237b68fe1e97500871bbf7980b
MD5 08726a543ca1d0b4d86a55db5513403d
BLAKE2b-256 1e3a5372db089dba1262f7e00322feb5868e11f2112a478a3a7b29ed2f2efbca

See more details on using hashes here.

File details

Details for the file unicode_segment-0.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for unicode_segment-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2635f8f074734265557059a7d5c50693ee98c59c60c73df4095159b47833bad4
MD5 7b1311b9b9b3cc546bd046e7152e6123
BLAKE2b-256 e0fa477d92e2533e7f93cb9eab60d4caacf6173d69347395852b88c5b8f620e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page