Skip to main content

Determine Unicode text segmentations

Project description

Introduction

A Python package to determine Unicode text segmentations.

You can see the full documentation including the package reference on https://uniseg-py.readthedocs.io.

Note (2022-09-26): This version (0.7.2) will be the last version which says that supports old Python interpreters. Actually, 2.x interpreters are no longer tested and versions less than 3.8 are not considered as taget platform versions now. Compatible codes for 2.x will be removed in further releases. They remain just because of historical reasons and my laziness.

Features

This package provides:

  • Functions to get Unicode Character Database (UCD) properties concerned with text segmentations.

  • Functions to determine segmentation boundaries of Unicode strings.

  • Classes that help implement Unicode-aware text wrapping on both console (monospace) and graphical (monospace / proportional) font environments.

Supporting segmentations are:

code point

Code point is “any value in the Unicode codespace.” It is the basic unit for processing Unicode strings.

grapheme cluster

Grapheme cluster approximately represents “user-perceived character.” They may be made up of single or multiple Unicode code points. e.g. “G” + acute-accent is a user-perceived character.

word break

Word boundaries are familiar segmentation in many common text operations. e.g. Unit for text highlighting, cursor jumping etc. Note that words are not determinable only by spaces or punctuations in text in some languages. Such languages like Thai or Japanese require dictionaries to determine appropriate word boundaries. Though the package only provides simple word breaking implementation which is based on the scripts and doesn’t use any dictionaries, it also provides ways to customize its default behavior.

sentence break

Sentence breaks are also common in text processing but they are more contextual and less formal. The sentence breaking implementation (which is specified in UAX: Unicode Standard Annex) in the package is simple and formal too. But it must be still useful in some usages.

line break

Implementing line breaking algorithm is one of the key features of this package. The feature is important in many general text presentations in both CLI and GUI applications.

Requirements

  • Python 2.7 / 3.4 / 3.5 / 3.6

Download

Source / binary distributions (PyPI)

https://pypi.python.org/pypi/uniseg

All sources and build tools etc. (Bitbucket)

https://bitbucket.org/emptypage/uniseg-py

Install

Just type:

% pip install uniseg

or download the archive and:

% python setup.py install

Changes

0.7.2 (2022-09-20)
0.7.1 (2015-05-02)
  • CHANGE: wrap.Wrapper.wrap(): returns the count of lines now.

  • Separate LICENSE from README.txt for the packaging-related reason in some environments.

0.7.0 (2015-02-27)
  • CHANGE: Quitted gathering all submodules’s members on the top, uniseg module.

  • CHANGE: Reform uniseg.wrap module and sample scripts.

  • Maintained uniseg.wrap module, and sample scripts work again.

0.6.4 (2015-02-10)
  • Add uniseg-dbpath console command, which just print the path of ucd.sqlite3.

  • Include sample scripts under the package’s subdirectory.

0.6.3 (2015-01-25)
  • Python 3.4

  • Support modern setuptools, pip and wheel.

0.6.2 (2013-06-09)
  • Python 3.3

0.6.1 (2013-06-08)
  • Unicode 6.2.0

References

UAX #14: Unicode Line Breaking Algorithm (6.2.0)

https://www.unicode.org/reports/tr14/tr14-30.html

UAX #29 Unicode Text Segmentation (6.2.0)

https://www.unicode.org/reports/tr29/tr29-21.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniseg-0.7.2.zip (140.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uniseg-0.7.2-py2.py3-none-any.whl (129.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file uniseg-0.7.2.zip.

File metadata

  • Download URL: uniseg-0.7.2.zip
  • Upload date:
  • Size: 140.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for uniseg-0.7.2.zip
Algorithm Hash digest
SHA256 e627b865ee7d246cb583e87b3a66ebf55e7cac82fe6fb09c6bd475195486c397
MD5 867ea43e3efe915f4294eb10ea9e9886
BLAKE2b-256 21331ffdf2e7f003b59c2fd60f2a4375a6c7320d598b1ac7abc4521d138330bc

See more details on using hashes here.

File details

Details for the file uniseg-0.7.2-py2.py3-none-any.whl.

File metadata

  • Download URL: uniseg-0.7.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 129.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for uniseg-0.7.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 018d23fe3537cbea1e7859ce59ebac7e963e33d8fb43f505ddb6b41ca9d233d4
MD5 55c0e2450864060018f0d56e04d20f0f
BLAKE2b-256 4b22cbcc1ced90a3bb1fcf8307e5380aa88b971f235d8837db8273f5385402d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page