Skip to main content

Segment text with Unicode TR29-compliant segmenters.

Project description

Unicode Segment

Segment text with Unicode TR29-compliant segmenters.

Segmenters are available for grapheme clusters (GraphemeSegmenter), words (WordSegmenter), and sentences (SentenceSegmenter).

Segmenters pass all tests from the Unicode Character Database (UCD) test data files as of Unicode 17.0.0, but do not currently support locale-specific tailoring.

Usage

from unicode_segment import SentenceSegmenter

segmenter = SentenceSegmenter()

text = (
    "“How do you know I’m mad?”, said Alice. “You must be,” said the Cat, "
    "“or you wouldn’t have come here.”"
)
assert list(segmenter.segment(text)) == [
    (0, "“How do you know I’m mad?”, said Alice. "),
    (40, "“You must be,” said the Cat, “or you wouldn’t have come here.”"),
]

text = (
    "此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀物种的"
    "保存和景观似乎都闪烁着倒影。潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。随着大"
    "幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。"
)
assert list(segmenter.segment(text)) == [
    (
        0,
        "此次学术会议汇聚了来自世界各地的茂盛植物被精心布置在各个工位之间无缝穿梭,相互配合,确保珍稀"
        "物种的保存和景观似乎都闪烁着倒影。",
    ),
    (
        63,
        "潮湿的泥土和新鲜雨水的抚慰,提醒我们注意自然循环中固有的平和节奏。",
    ),
    (
        96,
        "随着大幕拉开,剧院的灯光迎接着夜晚的到来,郊区生活的平静和安宁。",
    ),
]

text = (
    "ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው ዓለም "
    "ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ ድንኳኖች ተሞልተው "
    "እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ ያለፈውን፣ ሸራዎችን እያስተካከሉ "
    "እና ቅዝቃዜን ሰጥቷል። የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ "
    "ይጋብዛሉ። አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ "
    "በታማኝነት መፈጸሙን አረጋግጠዋል."
)
assert list(segmenter.segment(text)) == [
    (
        0,
        "ሻጮች በሚያቀርቡት ድምፅ ሆኖ አገልግሏል፣ ይህም በረዶው ከቀለጠ በኋላ ጎብኚዎች ምቹ ጎጆ ከውጭው "
        "ዓለም ሙሉ በሙሉ በወቅቱ አስማት ያበራ ይመስላል፣ በተጋገሩ ደስታዎች ላይ ተዘርግቷል ፣ የጨዋታ "
        "ድንኳኖች ተሞልተው እያንዳንዱ ክፍል የክህሎት እና ተግሣጽ፣ በስብሰባው ላይ አፅንዖት ይሰጣል፣ "
        "ያለፈውን፣ ሸራዎችን እያስተካከሉ እና ቅዝቃዜን ሰጥቷል። ",
    ),
    (
        219,
        "የተወሳሰቡ ቅርጻ ቅርጾች እና ከአዲስ የተጠበሰ ዳቦ መዓዛ ተሰብሳቢዎቹ እንዲዝናኑ ይጋብዛሉ። ",
    ),
    (
        278,
        "አየሩ በሳር፣ በሁሉም ዝርዝር ውስጥ ፣በፈጠራ እና በእውቀት መሬቱን ይንከባከባሉ, ዲዛይኑ በታማኝነት መፈጸሙን "
        "አረጋግጠዋል.",
    ),
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicode_segment-0.4.4.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicode_segment-0.4.4-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file unicode_segment-0.4.4.tar.gz.

File metadata

  • Download URL: unicode_segment-0.4.4.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for unicode_segment-0.4.4.tar.gz
Algorithm Hash digest
SHA256 48539fd4c82fd127e8e205e45fcdd98d4af79f78bc61e1006fb2871e09c02756
MD5 7d426255d4b7bfcb7300158f5ed7c6f0
BLAKE2b-256 64e475ff3bccbd6b14623a213ec3afa89a4a6436db004fb426d5df84288fd5db

See more details on using hashes here.

File details

Details for the file unicode_segment-0.4.4-py3-none-any.whl.

File metadata

File hashes

Hashes for unicode_segment-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 460d5b0cf97633b526a248c894016dc6eedad34a99772f181176e47c3a1f2d8a
MD5 1c4b4d1d2533e6a646786f5d84493364
BLAKE2b-256 612243531776d585da52ffd34b461c5d127dcfdd7fead21f634abdb44460d03a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page