A library for segmenting grapheme clusters.
Project description
Grapheme Cluster Break (Python)
A high-performance Python library for segmenting Unicode strings into grapheme clusters (user-perceived characters) according to UAX #29: Unicode Text Segmentation.
Installation
pip install grapheme-cluster-break
Usage
from grapheme_cluster_break import segment_grapheme_clusters
# Basic usage
clusters = segment_grapheme_clusters("Hello")
print(clusters) # ['H', 'e', 'l', 'l', 'o']
# Emoji ZWJ sequences
clusters = segment_grapheme_clusters("👨👩👧👦")
print(clusters) # ['👨👩👧👦']
# Combining characters
clusters = segment_grapheme_clusters("é") # e + combining acute accent
print(clusters) # ['é']
# Regional indicators (flags)
clusters = segment_grapheme_clusters("🇨🇳🇺🇸")
print(clusters) # ['🇨🇳', '🇺🇸']
# Indic conjuncts
clusters = segment_grapheme_clusters("क्ष") # Devanagari ksha
print(clusters) # ['क्ष']
# CJK characters
clusters = segment_grapheme_clusters("你好世界")
print(clusters) # ['你', '好', '世', '界']
# Hangul
clusters = segment_grapheme_clusters("한글")
print(clusters) # ['한', '글']
API Reference
segment_grapheme_clusters(s, extended=True)
Segments a UTF-8 string into grapheme clusters.
Parameters:
s(str) - The input string to segment.extended(bool, optional) - IfTrue(default), uses extended grapheme cluster rules. IfFalse, uses legacy rules.
Returns:
list[str]- A list of strings, each representing one grapheme cluster.
Building from Source
# Install build dependencies
pip install scikit-build-core pybind11
# Build and install
pip install .
# Run tests
pip install pytest
pytest python/tests/
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file grapheme_cluster_break-1.1.1.tar.gz.
File metadata
- Download URL: grapheme_cluster_break-1.1.1.tar.gz
- Upload date:
- Size: 46.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
196d9b18b37aac8bb5eba4aaf9dd350532204c141506794779568c8a5d24ce7a
|
|
| MD5 |
77c20eee1c3ea997c535c7edb6c63d36
|
|
| BLAKE2b-256 |
e34758a0f3d793816b4ee392a9a5e3d70ae8bbaa9589837b5fdfe563cbd370c7
|