Skip to main content

An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29.

Project description

pyuegc

A pure-Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into extended grapheme clusters (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation.” This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official Unicode test file to ensure accuracy.

Installation and updates

To install the package, run:

pip install pyuegc

To upgrade to the latest version, run:

pip install pyuegc --upgrade

Changelog

Check out the latest updates and changes here.

Unicode character database (UCD) version

To retrieve the version of the Unicode character database in use:

>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'16.0.0'

Example usage

from pyuegc import EGC

def _output(unistr, egc):
    return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""

unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6

unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5

unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5

unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6

unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6

Reversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:

unistr = "ai\u0302ne\u0301e"  # aînée

print(f"# Reversed string: {''.join(reversed(unistr))!r}")
# Reversed string: 'éen̂ia'

print(f"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}")
# EGC processed and reversed: 'eénîa'

Related resources

This implementation is based on the following resources:

Licenses

The code is licensed under the MIT license.

Usage of Unicode data files is governed by the UNICODE TERMS OF USE. Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the Unicode Data Files and Software License, a copy of which is included as UNICODE-LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyuegc-16.0.2.tar.gz (64.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyuegc-16.0.2-py3-none-any.whl (61.8 kB view details)

Uploaded Python 3

File details

Details for the file pyuegc-16.0.2.tar.gz.

File metadata

  • Download URL: pyuegc-16.0.2.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for pyuegc-16.0.2.tar.gz
Algorithm Hash digest
SHA256 9bd97c8287ac06b701ceb18a966b7e90cc1aee7315e7014223292613263a968e
MD5 e744162051fa1095c19fde7bf2b2b05c
BLAKE2b-256 f91702f3439023cf492d3a4d602a744647373ebe8fb48f2c1aeade0d2e07f0c0

See more details on using hashes here.

File details

Details for the file pyuegc-16.0.2-py3-none-any.whl.

File metadata

  • Download URL: pyuegc-16.0.2-py3-none-any.whl
  • Upload date:
  • Size: 61.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for pyuegc-16.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0ae28b7dcd4fa52eb2d939035d171d5891e3464a3a875841b9911912d7486544
MD5 553940b51fad0e1444f63a39ffdc7139
BLAKE2b-256 7d9c4637ea0f22597ee2ee3b86d03df4f1492e9f0e1b32d5830ee26e1664439f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page