Skip to main content

Determine Unicode text segmentations

Project description


A Python package to determine Unicode text segmentations.

You can see the full documentation including the package reference on


This package provides:

  • Functions to get Unicode Character Database (UCD) properties concerned with text segmentations.
  • Functions to determine segmentation boundaries of Unicode strings.
  • Classes that help implement Unicode-aware text wrapping on both console (monospace) and graphical (monospace / proportional) font environments.

Supporting segmentations are:

code point
Code point is “any value in the Unicode codespace.” It is the basic unit for processing Unicode strings.
grapheme cluster
Grapheme cluster approximately represents “user-perceived character.” They may be made up of single or multiple Unicode code points. e.g. “G” + acute-accent is a user-perceived character.
word break
Word boundaries are familiar segmentation in many common text operations. e.g. Unit for text highlighting, cursor jumping etc. Note that words are not determinable only by spaces or punctuations in text in some languages. Such languages like Thai or Japanese require dictionaries to determine appropriate word boundaries. Though the package only provides simple word breaking implementation which is based on the scripts and doesn’t use any dictionaries, it also provides ways to customize its default behavior.
sentence break
Sentence breaks are also common in text processing but they are more contextual and less formal. The sentence breaking implementation (which is specified in UAX: Unicode Standard Annex) in the package is simple and formal too. But it must be still useful in some usages.
line break
Implementing line breaking algorithm is one of the key features of this package. The feature is important in many general text presentations in both CLI and GUI applications.


  • Python 2.7 / 3.4 / 3.5 / 3.6


Source / binary distributions (PyPI)
All sources and build tools etc. (Bitbucket)


Just type:

% pip install uniseg

or download the archive and:

% python install


0.7.1 (2015-05-02)
  • CHANGE: wrap.Wrapper.wrap(): returns the count of lines now.
  • Separate LICENSE from README.txt for the packaging-related reason in some environments.
0.7.0 (2015-02-27)
  • CHANGE: Quitted gathering all submodules’s members on the top, uniseg module.
  • CHANGE: Reform uniseg.wrap module and sample scripts.
  • Maintained uniseg.wrap module, and sample scripts work again.
0.6.4 (2015-02-10)
  • Add uniseg-dbpath console command, which just print the path of ucd.sqlite3.
  • Include sample scripts under the package’s subdirectory.
0.6.3 (2015-01-25)
  • Python 3.4
  • Support modern setuptools, pip and wheel.
0.6.2 (2013-06-09)
  • Python 3.3
0.6.1 (2013-06-08)
  • Unicode 6.2.0


UAX #14: Unicode Line Breaking Algorithm (6.2.0)
UAX #29 Unicode Text Segmentation (6.2.0)

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for uniseg, version 0.7.1.post2
Filename, size File type Python version Upload date Hashes
Filename, size uniseg-0.7.1.post2-py2.py3-none-any.whl (1.5 MB) File type Wheel Python version py2.py3 Upload date Hashes View
Filename, size (1.5 MB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page