uniseg

Determine Unicode text segmentations

These details have not been verified by PyPI

Project links

Project description

A Python package to determine Unicode text segmentations.

News

We released the version 0.9.0 on November, 2024, and this is the first release ever which passes all the Unicode breaking tests (congrats!). And now I’m going to make its release number to 1.0, with some breaking changes for the APIs soon. Thank you.

Features

This package provides:

Functions to get Unicode Character Database (UCD) properties concerned with text segmentations.
Functions to determine segmentation boundaries of Unicode strings.
Classes that help implement Unicode-aware text wrapping on both console (monospace) and graphical (monospace / proportional) font environments.

Supporting segmentations are:

code point

Code point is “any value in the Unicode codespace.” It is the basic unit for processing Unicode strings.

Historically, units per Unicode string object on elder versions of Python was build-dependent. Some builds uses UTF-16 as an implementation for that and treat each code point greater than U+FFFF as a “surrogate pair”, which is a pair of the special two code points. The uniseg package had provided utility functions in order to treat Unicode strings per proper code points on every platform.

Since Python 3.3, The Unicode string is implemented with “flexible string representation”, which gives access to full code points and space-efficiency [PEP 393]. So you don’t need to worry about treating complex multi-code-points issue any more. If you want to treat some Unicode string per code point, just iterate that like: for c in s:. So uniseg.codepoint module has been deprecated and deleted.

grapheme cluster

Grapheme cluster approximately represents “user-perceived character.” They may be made up of single or multiple Unicode code points. e.g. “g̈”, “g” + combining diaeresis is a single user-perceived character, while which represents with two code points, U+0067 LATIN SMALL LETTER G and U+0308 COMBINING DIAERESIS.

word break

Word boundaries are familiar segmentation in many common text operations. e.g. Unit for text highlighting, cursor jumping etc. Note that words are not determinable only by spaces or punctuations in text in some languages. Such languages like Thai or Japanese require dictionaries to determine appropriate word boundaries. Though the package only provides simple word breaking implementation which is based on the scripts and doesn’t use any dictionaries, it also provides ways to customize its default behavior.

sentence break

Sentence breaks are also common in text processing but they are more contextual and less formal. The sentence breaking implementation (which is specified in UAX: Unicode Standard Annex) in the package is simple and formal too. But it must be still useful in some usages.

line break

Implementing line breaking algorithm is one of the key features of this package. The feature is important in many general text presentations in both CLI and GUI applications.

Requirements

Python 3.9 or later.

Install

$ pip install uniseg

Changes

0.10.1 (2025-05-11)

Fix line_break('\U00010000') returned wrong property value.

0.10.0 (2025-02-23)

Add tailor argument for tt_wrap.

0.9.1 (2025-01-16)

Fix ambiguous_as_wide options are not working on uniseg.wrap.

0.9.0 (2024-11-07)

Unicode 16.0.0.
Rule-based grapheme cluster segmentation is back.
And, this is the first release ever that passes the entire Unicode breaking tests!

0.8.1 (2024-08-13)

Fix sentence_break(‘/’) raised an exception. (Thanks to Nathaniel Mills)

0.8.0 (2024-02-08)

Unicode 15.0.0.
Regex-based grapheme cluster segmentation.
Quit supporting Python versions < 3.8.

0.7.2 (2022-09-20)

Improve performance of Unicode lookups. PR by Max Bachmann.

0.7.1 (2015-05-02)

CHANGE: wrap.Wrapper.wrap(): returns the count of lines now.
Separate LICENSE from README.txt for the packaging-related reason in some environments.

0.7.0 (2015-02-27)

CHANGE: Quitted gathering all submodules’s members on the top, uniseg module.
CHANGE: Reform uniseg.wrap module and sample scripts.
Maintained uniseg.wrap module, and sample scripts work again.

0.6.4 (2015-02-10)

Add uniseg-dbpath console command, which just print the path of ucd.sqlite3.
Include sample scripts under the package’s subdirectory.

0.6.3 (2015-01-25)

Python 3.4
Support modern setuptools, pip and wheel.

0.6.2 (2013-06-09)

Python 3.3

0.6.1 (2013-06-08)

Unicode 6.2.0

References

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.10.1

Jan 9, 2026

0.10.0

Jan 23, 2025

0.9.1

Jan 15, 2025

0.9.0

Nov 7, 2024

0.8.1

Aug 12, 2024

0.8.0

Feb 10, 2024

0.7.2

Sep 26, 2022

0.7.1.post2

Mar 25, 2021

0.7.1.post1

Mar 25, 2021

0.7.1

May 6, 2015

0.7.0

Feb 27, 2015

0.6.4

Feb 10, 2015

0.6.3

Jan 24, 2015

0.6.2

Jun 9, 2013

0.6.1

Jun 8, 2013

0.6.0

Jun 8, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniseg-0.10.1.tar.gz (8.2 MB view details)

Uploaded Jan 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uniseg-0.10.1-py3-none-any.whl (8.2 MB view details)

Uploaded Jan 9, 2026 Python 3

File details

Details for the file uniseg-0.10.1.tar.gz.

File metadata

Download URL: uniseg-0.10.1.tar.gz
Upload date: Jan 9, 2026
Size: 8.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for uniseg-0.10.1.tar.gz
Algorithm	Hash digest
SHA256	`5224267916fc01132cb2f7bf60882402dde72ab52c6977184827e4d9d6676de6`
MD5	`68ac6e69e6e235a4ef7cf1a7818b7bc2`
BLAKE2b-256	`fc5efc5b1c370dc523a580cb214ee7753c8631a2d4be8fc51283785415b62d92`

See more details on using hashes here.

File details

Details for the file uniseg-0.10.1-py3-none-any.whl.

File metadata

Download URL: uniseg-0.10.1-py3-none-any.whl
Upload date: Jan 9, 2026
Size: 8.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for uniseg-0.10.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74fe3f7b4ee46bbd703a70ea1e5cd28e04f63eee5af9a10388d93316c554e358`
MD5	`afe911d72ff71cc1163518a1fdfef876`
BLAKE2b-256	`806c9eb6d93ad8b9ac96cba13e6d2f419c316d8fdb91013e12dd478a99501699`

See more details on using hashes here.

uniseg 0.10.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

News

Features

Requirements

Install

Changes

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

uniseg 0.10.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

News

Features

Requirements

Install

Changes

References

Related / Similar Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes