Skip to main content

An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.

Project description

pyuegc

A pure-Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into extended grapheme clusters (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation.” This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official Unicode test file to ensure accuracy.

Installation and updates

To install the package, run:

pip install pyuegc

To upgrade to the latest version, run:

pip install pyuegc --upgrade

Changelog

Check out the latest updates and changes here.

Unicode character database (UCD) version

To retrieve the version of the Unicode character database in use:

>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'16.0.0'

Example usage

from pyuegc import EGC

def _output(unistr, egc):
    return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""

unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6

unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5

unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5

unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6

unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6

Reversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:

unistr = "ai\u0302ne\u0301e"  # aînée

print(f"# Reversed string: {''.join(reversed(unistr))!r}")
# Reversed string: 'éen̂ia'

print(f"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}")
# EGC processed and reversed: 'eénîa'

Related resources

This implementation is based on the following resources:

Licenses

The code is licensed under the MIT license.

Usage of Unicode data files is governed by the UNICODE TERMS OF USE. Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the Unicode Data Files and Software License, a copy of which is included as UNICODE-LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyuegc-16.0.3.tar.gz (64.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyuegc-16.0.3-py3-none-any.whl (61.8 kB view details)

Uploaded Python 3

File details

Details for the file pyuegc-16.0.3.tar.gz.

File metadata

  • Download URL: pyuegc-16.0.3.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for pyuegc-16.0.3.tar.gz
Algorithm Hash digest
SHA256 36538f25fa82640a42f6af6db1bb6aa576ed3a1f1bcd17949f8af4c522883dec
MD5 304dabfc6ba5fcde7d992c6118937dcf
BLAKE2b-256 313c6db6a6864ac5672c0927b9dc755de4b1441d919e673003e7016b0e6722e4

See more details on using hashes here.

File details

Details for the file pyuegc-16.0.3-py3-none-any.whl.

File metadata

  • Download URL: pyuegc-16.0.3-py3-none-any.whl
  • Upload date:
  • Size: 61.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for pyuegc-16.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5bfaf59c7e20d3414960922259256e17f88d24aab8d8a913eb0ac3cc3453bcf6
MD5 042b69cdbcf076b76394bd1396dffaf2
BLAKE2b-256 74a9db75ea9ad7e3d0bff3ada5f70aa9697129b697368d26b870ed3b06f233b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page