Unicode 16.0 Extended grapheme clusters in nanoseconds
Project description
Overview
Use ugrapheme
to make your Python and Cython code see strings as a sequence of grapheme characters, so that the length of 👩🏽🔬🏴Hi
is 4 instead of 13.
Trivial operations like reversing a string, getting the first and last character, etc. become easy not just for Latin and Emojis, but Devanagari, Hangul, Tamil, Bengali, Arabic, etc. Centering and justifying Emojis and non-Latin text in terminal output becomes easy again, as ugrapheme
uses uwcwidth under the hood.
ugrapheme
exposes an interface that's almost identical to Python's native strings and maintains a similar performance envelope, processing strings at hundreds of megabytes or even gigabytes per second:
graphemes | graphemes result |
str | str result |
---|---|---|---|
g = graphemes('👩🏽🔬🏴Hi') |
s = '👩🏽🔬🏴Hi' |
||
len(g) |
4 |
len(s) |
13 |
print(g[0]) |
👩🏽🔬 |
print(s[0]) |
👩 |
print(g[2]) |
H |
print(s[2]) |
🔬 |
print(g[2:]) |
Hi |
print(s[2:]) |
🔬🏴Hi |
print(g[::-1]) |
iH🏴👩🏽🔬 |
print(s[::-1]) |
iH🏴🔬🏽👩 |
g.find('🔬') |
-1 |
s.find('🔬') |
3 |
print(','.join(g)) |
👩🏽🔬,🏴,H,i |
print(','.join(s)) |
👩,🏽,,🔬,🏴,,,,,,,H,i |
print(g.center(10, '-')) |
--👩🏽🔬🏴Hi-- |
print(s.center(10, '-')) |
👩🏽🔬🏴Hi |
print(max(g)) |
👩🏽🔬 |
print(max(s)) |
unprintable |
print(','.join(set(g))) |
i,🏴,👩🏽🔬,H |
print(','.join(set(s))) |
,H,,🏴,,,,i,,,🏽,👩,🔬 |
Just like native Python strings, graphemes
are hashable, iterable and pickleable.
Aside from passing the Unicode 16.0 UAX #29 Extended Grapheme Clusters grapheme break tests, ugrapheme
correctly parses many difficult cases that break other libraries in Python and other languages.
As of this writing (October 2024), ugrapheme
is among the fastest and probably among more correct implementations across all programming languages and operating systems.
Installation
pip install ugrapheme
Basic usage
In [1]: from ugrapheme import graphemes
In [2]: g = graphemes("👩🏽🔬🏴Hi")
In [3]: print(g[0])
👩🏽🔬
In [4]: print(g[-1])
i
In [5]: len(g)
Out[5]: 4
In [6]: print(g.center(10) + '\n0123456789')
👩🏽🔬🏴Hi
0123456789
In [7]: print(g * 5)
👩🏽🔬🏴Hi👩🏽🔬🏴Hi👩🏽🔬🏴Hi👩🏽🔬🏴Hi👩🏽🔬🏴Hi
In [8]: print(g.join(["Ho", "Hey"]))
Ho👩🏽🔬🏴HiHey
In [9]: print(g.replace('🏴','<scotland>'))
👩🏽🔬<scotland>Hi
In [10]: namaste = graphemes('नमस्ते')
In [11]: list(namaste)
Out[11]: ['न', 'म', 'स्ते']
In [12]: print('>> ' + g[::-1] + namaste + ' <<')
>> iH🏴👩🏽🔬नमस्ते <<
Documentation
Aside from this file, all public methods have detailed docstrings with examples, which should hopefully show up in IPython, VS Code, Jupyter Notebook or whatever else you happen to be using.
Performance: pyuegc 25x slower, uniseg 45x slower, ...
The popular Python grapheme splitting libraries are dramatically slower. Some could not even return the correct results despite spending orders of magnitude more CPU on the same task.
I gave these libraries the benefit of doubt by employing them on simple tasks such as returning the list of graphemes. The graphemes
object takes an even smaller amount of time to build and takes less memory than a Python list of strings that these libraries expect you to work with, but let's try and do apples to apples here..
pyuegc: 24x slower
In [1]: from pyuegc import EGC
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(EGC("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद")))
H,e,l,l,o, ,👩🏽🔬,!, ,👩🏼❤️💋👨🏾, ,अ,नु,च्छे,द
In [4]: print(','.join(grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद")))
H,e,l,l,o, ,👩🏽🔬,!, ,👩🏼❤️💋👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
...: EGC("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद")
8.19 μs ± 77.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %%timeit
...: grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद")
337 ns ± 3.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
uniseg: 45x slower, incorrect
In [1]: from uniseg.graphemecluster import grapheme_clusters
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(grapheme_clusters("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद"))) # Wrong
H,e,l,l,o, ,👩🏽🔬,!, ,👩🏼❤️💋👨🏾, ,अ,नु,च्,छे,द
In [4]: print(','.join(grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद"))) # Correct
H,e,l,l,o, ,👩🏽🔬,!, ,👩🏼❤️💋👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
...: list(grapheme_clusters("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद"))
14.6 μs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %%timeit
...: grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद")
340 ns ± 5.31 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
grapheme: 52x slower, incorrect
In [1]: from grapheme import graphemes
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(graphemes("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद"))) # Wrong
H,e,l,l,o, ,👩🏽🔬,!, ,👩🏼❤️💋👨🏾, ,अ,नु,च्,छे,द
In [4]: print(','.join(grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद"))) # Correct
H,e,l,l,o, ,👩🏽🔬,!, ,👩🏼❤️💋👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
...: list(graphemes("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद"))
17.4 μs ± 26.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %%timeit
...: grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद")
332 ns ± 0.79 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
pyicu: 5x slower, incorrect
In [1]: import icu
...: def iterate_breaks(text, break_iterator):
...: break_iterator.setText(text)
...: lastpos = 0
...: while True:
...: next_boundary = break_iterator.nextBoundary()
...: if next_boundary == -1: return
...: yield text[lastpos:next_boundary]
...: lastpos = next_boundary
...: bi = icu.BreakIterator.createCharacterInstance(icu.Locale.getRoot())In
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(iterate_breaks("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद", bi))) # Wrong
H,e,l,l,o, ,👩🏽🔬! 👩,🏼,,❤️💋👨🏾 अनुच्छे,द,,,,
In [4]: print(','.join(grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद"))) # Correct
H,e,l,l,o, ,👩🏽🔬,!, ,👩🏼❤️💋👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
...: list(iterate_breaks("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद", bi))
1.6 μs ± 9.62 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [6]: %%timeit
...: grapheme_split("Hello 👩🏽🔬! 👩🏼❤️💋👨🏾 अनुच्छेद")
335 ns ± 1.48 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Gotchas and performance tips
Standalone functions for highest performance
The graphemes
type is overall optimized for minimal CPU overhead, taking nanoseconds to instantiate and around 4 bytes extra for each string character. However, if you want absolutely the maximum performance and only want specific grapheme information, try the grapheme_
family of standalone functions as these do not allocate memory or preprocess the input string in any way:
In [1]: from ugrapheme import (grapheme_len, grapheme_split,
...: grapheme_iter, grapheme_at, grapheme_slice)
In [2]: grapheme_len("👩🏽🔬🏴Hi")
Out[2]: 4
In [3]: grapheme_split('नमस्ते')
Out[3]: ['न', 'म', 'स्ते']
In [4]: grapheme_slice('👩🏽🔬🏴Hi', 2, 4)
Out[4]: 'Hi'
In [5]: grapheme_at('नमस्ते', 2)
Out[5]: 'स्ते'
In [6]: for gr in grapheme_iter('👩🏽🔬🏴Hi'):
...: print(gr)
...:
👩🏽🔬
🏴
H
i
Just like the graphemes
methods, the individual functions can be cimported
into Cython for even less overhead.
Concatenating
The fastest way to concatenate many graphemes
and strings into another graphemes
is to join them by using graphemes('').join
, for example:
g2 = graphemes('').join(['>> ', g, ' -- ', namaste, ' <<'])
If you are just joining everything into a string, use the string .join
method, it will work fine and be faster. Converting a graphemes
object into a string is instantaneous, as graphemes
works with native Python strings internally:
s2 = ''.join(['>> ', g, ' -- ', namaste, ' <<'])
Slices are strings, not graphemes
When you take a slice of a graphemes
, you get back a Python string.
If you want to keep working with graphemes()
, use the gslice
method:
In [2]: g = graphemes("👩🏽🔬🏴Hi")
In [3]: print(g[:2])
👩🏽🔬🏴
In [4]: len(g[:2]) # Returns 11, because g[:2] is a string
Out[4]: 11
In [5]: print(g.gslice(end=2))
👩🏽🔬🏴
In [6]: len(g.gslice(end=2))
Out[6]: 2
Using gslice
constructs another graphemes
object, which takes additional CPU and memory. Not returning graphemes
saves a bunch of nanoseconds (percentage-wise) on very small strings, but unfortunately introduces this quirk.
Cython
If you are using graphemes
in a Cython project, you can further dramatically improve performance by doing a cimport
of the provided .pxd
files. There are fully-typed versions of operations such as accessing individual grapheme characters, slicing, find, replace, append, etc.
Performance explained
What's hard about this?
Not only are individual graphemes formed by rules that take dozens of pages to describe, there's tables that need to be consulted. If naively implemented, you are computing hashes for each codepoint and doing random access across hundreds of kilobytes or even megabytes of RAM.
To make things even more complex, Python internally represents strings in different data formats depending on whether they contain ASCII, Latin1, 2-byte unicode or 4-byte unicode characters and whether they are "compact" or not. ugrapheme
internally understands the different formats and has separate low-level implementations for different combinations of underlying formats.
Custom sparse bitmaps instead of tables, tries or maps
Using custom sparse bitmap datastructures, ugrapheme
stores properties for every possible unicode codepoint in less than 11KB of data, comfortably fitting into L1 cache of most CPUs produced in the last 20 years. The property lookup costs 2 loads, a few shifts and maybe a compare or two. Furthermore, similar characters occur in similar places, so for most text only a few smaller contiguous portions of tables are actually used and read.
Simpler LL(1) parser / DFA
Instead of implementing the state machine suggested by UAX #29, ugrapheme
implements the rules as an LL(1) grammar parsed by a DFA with a very small number of states. An attempt is made towards a minimum number of codepoint property lookups to decide on a state transition. An unsigned comparison is used as an early Bloom filter for properties. Eventually, the CPU's branch predictor gets trained on the text you are processing and skips character classes your text does not belong to. Easier strings to process, such as those containing mostly ASCII or Latin1 will take less than a nanosecond per character on a fast 2024 laptop, reaching around 1.5 GB/second.
CPython optimizations
The underlying Python string representation is kept and not copied or transformed, reducing memory pressure. Concatenations, joins, replications and substring replaces are done speculatively, assuming easy cases and reverting to more memory allocation or recalculations of underlying grapheme boundaries as rarely as possible.
Mirroring the Reykjavik Need for Speed sprint from 2 decades ago, the efforts of Andrew Dalke and late Frederik Lundh have been replicated in spirit, so that we can match or even sometimes beat the native Python string library. Generally, there's hand-coded solutions to subcases in instantiation, replicating, concatenating, search and replace when they give a significant performance advantage.
Cython optimizations
In performance-critical loops, we sometimes do away with using Python/Cython object types and instead deal directly with the PyObject *
, to avoid Cython generating unnecessary reference count increments and decrements. Decrements in particular include checks and threaten to jump into a maze of deallocation code, confusing both the compiler and the CPU.
Correctness
Like many other libraries, ugrapheme
passes the unicode.org UAX #29 suggested GraphemeBreakTest.txt .
Separately, there's a brute-force exhaustive test over all possible unicode codepoints for ugrapheme
custom sparse bitmap data structures.
Separate tests cover many cases where concatenating or replacing a portion of a grapheme changes the underlying grapheme boundaries.
Here's some examples of corner cases:
from ugrapheme import graphemes
len(graphemes('hi') + chr(13))
# outputs 3
len(graphemes('hi') + chr(13) + chr(10))
# also outputs 3, because chr(13) + chr(10) is a single grapheme!
len(graphemes('hi') + chr(10) + 'there')
# outputs 8
len((graphemes('hi') + chr(10) + 'there').replace('i', 'i' + chr(13)))
# also outputs 8, because chr(13) + chr(10) is a single grapheme!
g = graphemes('Hi👍')
len(g)
# outputs 3
g += '🏾' # Adding a Fitzpatrick skin type modifier...
len(g) # ..does not change the grapheme length
# outputs 3
g
# outputs graphemes('Hi👍🏾')
Additionally, there's explicit tests for complicated graphemes known to have caused issues with other libraries, such as Devanagari conjuncts.
For slicing, joining and replacing substrings, there are extra unittests done to make sure we always create the correct underlying python string representation (ASCII, Latin1, 2-byte or 4-byte).
Limitations
graphemes()
string length is limited to 4,294,967,294 unicode codepoints.
The stand-alone functions grapheme_split
, grapheme_len
, grapheme_at
, etc. are not affected by this limit and work on all sizes.
Giving credit
The whole library is licensed under the most permissive license I could find, so there are absolutely no legal requirements for giving credit.
However, outside of legal requirements, I promise that those who misrepresent this work as theirs will be dealt with in a professional demoscene way.
License for Unicode Data Files (not packaged by default)
The test files for ugrapheme
, available under ugrapheme_ucd/data
are
Copyright © 1991-2024 Unicode, Inc.
and provided under the UNICODE LICENSE V3. See ugrapheme_ucd/data/LICENSE
.
The ugrapheme_ucd/data
is not shipped with the ugrapheme
library build, but only included inside the testing component.
License
© 2024 !ZAJC!/GDS
Licensed under the BSD Zero Clause License. See LICENSE in the project root, or the SPDX 0BSD page for full license information.
The SPDX license identifier for this project is 0BSD
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ugrapheme-0.7.tar.gz
.
File metadata
- Download URL: ugrapheme-0.7.tar.gz
- Upload date:
- Size: 55.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d07b7e1a0d0c7a79db626d080ec5fda8d7770be89c6c87417cb8536a196a36d7 |
|
MD5 | 7314657ab68c170be8da930c9a96b6fa |
|
BLAKE2b-256 | d943fd93a4c664921c2fdf8c13fb2d1c33b29db1b0b8ab74cd35988783165b17 |
File details
Details for the file ugrapheme-0.7-cp312-cp312-macosx_14_0_arm64.whl
.
File metadata
- Download URL: ugrapheme-0.7-cp312-cp312-macosx_14_0_arm64.whl
- Upload date:
- Size: 279.7 kB
- Tags: CPython 3.12, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06f355d71a08f086d4f4fbbcc5f08eaeaafa9a13a0405618a38b27c937cc47b9 |
|
MD5 | b7f517966de431e231098903c5cb500b |
|
BLAKE2b-256 | d71e3bc0048bed3050f74f359b0ae967633721794fcfcaee342d8fb0cc6e1e3f |