Skip to main content

A library for segmenting grapheme clusters.

Project description

Grapheme Cluster Break (Python)

Unicode 17.0.0

A high-performance Python library for segmenting Unicode strings into grapheme clusters (user-perceived characters) according to UAX #29: Unicode Text Segmentation.

Installation

pip install grapheme-cluster-break

Usage

from grapheme_cluster_break import segment_grapheme_clusters

# Basic usage
clusters = segment_grapheme_clusters("Hello")
print(clusters)  # ['H', 'e', 'l', 'l', 'o']

# Emoji ZWJ sequences
clusters = segment_grapheme_clusters("👨‍👩‍👧‍👦")
print(clusters)  # ['👨‍👩‍👧‍👦']

# Combining characters
clusters = segment_grapheme_clusters("é")  # e + combining acute accent
print(clusters)  # ['é']

# Regional indicators (flags)
clusters = segment_grapheme_clusters("🇨🇳🇺🇸")
print(clusters)  # ['🇨🇳', '🇺🇸']

# Indic conjuncts
clusters = segment_grapheme_clusters("क्ष")  # Devanagari ksha
print(clusters)  # ['क्ष']

# CJK characters
clusters = segment_grapheme_clusters("你好世界")
print(clusters)  # ['你', '好', '世', '界']

# Hangul
clusters = segment_grapheme_clusters("한글")
print(clusters)  # ['한', '글']

API Reference

segment_grapheme_clusters(s, extended=True)

Segments a UTF-8 string into grapheme clusters.

Parameters:

  • s (str) - The input string to segment.
  • extended (bool, optional) - If True (default), uses extended grapheme cluster rules. If False, uses legacy rules.

Returns:

  • list[str] - A list of strings, each representing one grapheme cluster.

Building from Source

# Install build dependencies
pip install scikit-build-core pybind11

# Build and install
pip install .

# Run tests
pip install pytest
pytest python/tests/

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grapheme_cluster_break-1.1.1.tar.gz (46.2 kB view details)

Uploaded Source

File details

Details for the file grapheme_cluster_break-1.1.1.tar.gz.

File metadata

  • Download URL: grapheme_cluster_break-1.1.1.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.20

File hashes

Hashes for grapheme_cluster_break-1.1.1.tar.gz
Algorithm Hash digest
SHA256 196d9b18b37aac8bb5eba4aaf9dd350532204c141506794779568c8a5d24ce7a
MD5 77c20eee1c3ea997c535c7edb6c63d36
BLAKE2b-256 e34758a0f3d793816b4ee392a9a5e3d70ae8bbaa9589837b5fdfe563cbd370c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page