Skip to main content

An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.

Project description

pyuegc

A pure-Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into extended grapheme clusters (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation.” This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official Unicode test file to ensure accuracy.

Installation and updates

To install the package, run:

pip install pyuegc

To upgrade to the latest version, run:

pip install pyuegc --upgrade

Changelog

Check out the latest updates and changes here.

Unicode character database (UCD) version

To retrieve the version of the Unicode character database in use:

>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'16.0.0'

Example usage

from pyuegc import EGC

def _output(unistr, egc):
    return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""

unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6

unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5

unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5

unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6

unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6

Reversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:

unistr = "ai\u0302ne\u0301e"  # aînée

print(f"# Reversed string: {''.join(reversed(unistr))!r}")
# Reversed string: 'éen̂ia'

print(f"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}")
# EGC processed and reversed: 'eénîa'

Related resources

This implementation is based on the following resources:

Licenses

The code is licensed under the MIT license.

Usage of Unicode data files is governed by the UNICODE TERMS OF USE. Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the Unicode Data Files and Software License, a copy of which is included as UNICODE-LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyuegc-16.0.1.tar.gz (64.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyuegc-16.0.1-py3-none-any.whl (61.7 kB view details)

Uploaded Python 3

File details

Details for the file pyuegc-16.0.1.tar.gz.

File metadata

  • Download URL: pyuegc-16.0.1.tar.gz
  • Upload date:
  • Size: 64.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for pyuegc-16.0.1.tar.gz
Algorithm Hash digest
SHA256 d4739934e28aeff0b5143a92ab1d18c5bd05dec4527b99ad5175b183ecb66f80
MD5 555faff8bf03a85b6b08210fc8cf1962
BLAKE2b-256 ac618e6635a34d0c8618f4cc0e512241ec7e80a48ec28f81f9909e84face30d0

See more details on using hashes here.

File details

Details for the file pyuegc-16.0.1-py3-none-any.whl.

File metadata

  • Download URL: pyuegc-16.0.1-py3-none-any.whl
  • Upload date:
  • Size: 61.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for pyuegc-16.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 73a0ce1b995b2568d41cd05db00bfb65b50eedb8ebb1432085a1985eea473636
MD5 56b4fb24844844b76ec56c2c06da096e
BLAKE2b-256 d383c4868ec2361688422f3070b61f27809c5b03f8ca62e5c4bdd0c7f8f740ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page