Skip to main content

An implementation of the Unicode algorithm for breaking code point sequences into extended grapheme clusters as specified in UAX #29

Project description

pyuegc

An implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into extended grapheme clusters (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation”. This package supports version 15.1 of the Unicode standard (released in September 2023). It has been thoroughly tested against the Unicode test file.

Installation

The easiest method to install is using pip:

pip install pyuegc

UCD version

To get the version of the Unicode character database currently used:

>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'15.1.0'

Example usage

from pyuegc import EGC


def _output(unistr, egc):
    return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""

unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Python
# Length of string: 6
# EGC: ['P', 'y', 't', 'h', 'o', 'n']
# Length of EGC: 6

unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: élève
# Length of string: 7
# EGC: ['é', 'l', 'è', 'v', 'e']
# Length of EGC: 5

unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒
# Length of string: 20
# EGC: ['Z̷̳̎', 'a̸̛ͅ', 'l̷̻̇', 'g̵͉̉', 'o̸̰͒']
# Length of EGC: 5

unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: 기운찰만하다
# Length of string: 15
# EGC: ['기', '운', '찰', '만', '하', '다']
# Length of EGC: 6

unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
# String: পৌষসংক্রান্তির
# Length of string: 14
# EGC: ['পৌ', 'ষ', 'সং', 'ক্রা', 'ন্তি', 'র']
# Length of EGC: 6


unistr = "ai\u0302ne\u0301e"  # aînée
print(f"# Reversed string:\n#   {''.join(reversed(unistr))}")
print(f"# Reversed EGC:   \n#   {''.join(reversed(EGC(unistr)))}")
# Reversed string:
#   éen̂ia -> wrong (diacritics are messed up)
# Reversed EGC:
#   eénîa -> right (regardless of the Unicode normalization form)

Related resources

This implementation is based on the following resources:

Licenses

The code is available under the MIT license.

Usage of Unicode data files is governed by the UNICODE TERMS OF USE. Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the Unicode Data Files and Software License, a copy of which is included as UNICODE-LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyuegc-15.1.0.tar.gz (81.3 kB view details)

Uploaded Source

File details

Details for the file pyuegc-15.1.0.tar.gz.

File metadata

  • Download URL: pyuegc-15.1.0.tar.gz
  • Upload date:
  • Size: 81.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.5

File hashes

Hashes for pyuegc-15.1.0.tar.gz
Algorithm Hash digest
SHA256 0786d5ca191997f183ffa7413240cac86100bb746117a1c0502314436db38e0b
MD5 19c0baaad7bb38523b6deee578f3cd13
BLAKE2b-256 05c2fa08cb7a50bfee73015d8c286f9c622fc7ecc8e7c5810c6627f3884d0a44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page