Unicode category database
Project description
unicategories
Unicode category database, generated and cached on setup.
This module exposes a category dictionary containing RangeGroup
instances,
containing all unicode category character ranges detected on your system.
Example
from unicategories import categories
upperchars = categories['Lu'].characters() # iterator
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
# Unicode uppercase caracters are "ABCDEF..."
RangeGroup
Immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's range
, open at the end.
This method have been chosen for memory efficiency, storing individually all characters on memory would take a lot of memory.
RangeGroup class provides the following methods:
range_group.characters()
type: () -> typing.Iterator[str]
Get iterator with all characters on this range group.
:returns: iterator of characters (str of size 1)
range_group.codes()
type: () -> typing.Iterator[int]
Get iterator for all unicode code points contained in this range group.
:returns: iterator of character indexes (int)
range_group.has(character)
type: (typing.Union[str, int]) -> bool
Get if character (or character code point) is part of this range group.
:param character: character or unicode code point to look for
:returns: True if character is contained by any range, False otherwise
Unicode categories
Taken from wikipedia.
Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |
---|---|---|---|---|---|
Lu | Letter, uppercase | Graphic | Character | ||
Ll | Letter, lowercase | Graphic | Character | ||
Lt | Letter, titlecase | Graphic | Character | Ligatures containing uppercase followed by lowercase letters (e.g., Dž , Lj , Nj , and Dz ) |
|
Lm | Letter, modifier | Graphic | Character | ||
Lo | Letter, other | Graphic | Character | ||
Mn | Mark, nonspacing | Graphic | Character | ||
Mc | Mark, spacing combining | Graphic | Character | ||
Me | Mark, enclosing | Graphic | Character | ||
Nd | Number, decimal digit | Graphic | Character | All these, and only these, have Numeric Type = De | |
Nl | Number, letter | Graphic | Character | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) | |
No | Number, other | Graphic | Character | E.g., vulgar fractions , superscript and subscript digits | |
Pc | Punctuation, connector | Graphic | Character | Includes "_" underscore | |
Pd | Punctuation, dash | Graphic | Character | Includes several hyphen characters | |
Ps | Punctuation, open | Graphic | Character | Opening bracket characters | |
Pe | Punctuation, close | Graphic | Character | Closing bracket characters | |
Pi | Punctuation, initial quote | Graphic | Character | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage | |
Pf | Punctuation, final quote | Graphic | Character | Closing quotation mark. May behave like Ps or Pe depending on usage | |
Po | Punctuation, other | Graphic | Character | ||
Sm | Symbol, math | Graphic | Character | ||
Sc | Symbol, currency | Graphic | Character | ||
Sk | Symbol, modifier | Graphic | Character | ||
So | Symbol, other | Graphic | Character | ||
Zs | Separator, space | Graphic | Character | Includes the space, but not TAB , CR , or LF , which are Cc | |
Zl | Separator, line | Format | Character | Only U+2028 LINE SEPARATOR (LSEP) | |
Zp | Separator, paragraph | Format | Character | Only U+2029 PARAGRAPH SEPARATOR (PSEP) | |
Cc | Other, control | Control | Character | Fixed 65 | No name , <control> |
Cf | Other, format | Format | Character | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters | |
Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , <surrogate> |
Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name , <private-use> |
Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , <noncharacter> |
Cn | Other, not assigned | Reserved | Not | Not fixed | No name , <reserved> |
In addition to that, unicategories provide general categories L
, M
, N
, P
, S
, Z
and C
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.