Unicode tools.
Project description
Tools for showing information about unicode characters (UTF-8) and performing normalization.
Copyright ® 2018, Luís Gomes <luismsgomes@gmail.com>.
The following command line tools are provided:
- ucinfo
writes on stdout the name of each unicode character read from stdin
- ucenum
enumerates on stdout all unicode characters of a chosen category
- ucnorm
applies a standard unicode normalization (NFC, NFKC, NFD or NFKD)
ucinfo
The ucinfo tool reads UTF-8 text from stdin and writes to stdout information about each character, one per line. The output has 5 tab-separated columns:
the character itself, if printable, or an escaped representation of it
the decimal codepoint of the character
the number of bytes that the character occupies
the Unicode category of the character
the Unicode name of the character
ucenum
- The ucenum tool takes a category abbreviation as argument and outputs a list
of all characters belonging to that category. The categories are:
- L
Letter
- Lu
Letter, Uppercase
- Ll
Letter, Lowercase
- Lt
Letter, Titlecase
- Lm
Letter, Modifier
- Lo
Letter, Other
- M
Mark
- Mn
Mark, Nonspacing
- Mc
Mark, Spacing Combining
- Me
Mark, Enclosing
- N
Number
- Nd
Number, Decimal Digit
- Nl
Number, Letter
- No
Number, Other
- P
Punctuation
- Pc
Punctuation, Connector
- Pd
Punctuation, Dash
- Ps
Punctuation, Open
- Pe
Punctuation, Close
- Pi
Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
- Pf
Punctuation, Final quote (may behave like Ps or Pe depending on usage)
- Po
Punctuation, Other
- S
Symbol
- Sm
Symbol, Math
- Sc
Symbol, Currency
- Sk
Symbol, Modifier
- So
Symbol, Other
- Z
Separator
- Zs
Separator, Space
- Zl
Separator, Line
- Zp
Separator, Paragraph
- C
Other
- Cc
Other, Control
- Cf
Other, Format
- Cs
Other, Surrogate
- Co
Other, Private Use
- Cn
Other, Not Assigned
ucnorm
This program reads UTF-8 text from stdin and writes it to stdout after applying the specified normalization algorithm.
The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
Even if two unicode strings look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.
For each character, there are two normal forms:
Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms based on compatibility equivalence:
Normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents.
Normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.
Compatibility decomposition ensures that equivalent characters will compare equal (i.e. have the same codepoints). In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).
This program uses the normalization algorithms implemented in Python’s standard library. See: https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file uctools-1.3.0.tar.gz
.
File metadata
- Download URL: uctools-1.3.0.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.45.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bc0d947215af420534c0f62dcc0272f0b6066ed5545eec88e9a3c2313f6b826 |
|
MD5 | b2c12b534fc0234bdf5dfc50950955be |
|
BLAKE2b-256 | 04cb70ed842d9a43460eedaa11f7503b4ab6537b43b63f0d854d59d8e150fac1 |