Unicode category database
Project description
unicategories
=============
Unicode category database, generated on setup.
This module exposes a category dictionary containing `RangeGroup`s, an
immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's `range`, open at the end.
This method have been chosen for memory efficiency, storing individually all
characters on memory would take a lot of memory.
RangeGroup class provides the following methods:
* **RangeGroup.characters(self)**::
Get iterator with all characters on this range group.
:yields: iterator of characters (str of size 1)
:ytype: str
* **RangeGroup.codes(self)**::
Get iterator for all unicode code points contained in this range group.
:yields: iterator of character index (int)
:ytype: int
* **RangeGroup.has(self, character)**::
Get if character (or character code point) is contained by any range on
this range group.
:param character: character or unicode code point to look for
:type character: str or int
:returns: True if character is contained by any range, False otherwise
:rtype: bool
Example
-------
.. code:: python
from unicategories import categories
upperchars = categories['Lu'].characters() # iterator
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
# Unicode uppercase caracters are "ABCDEF..."
Unicode categories
------------------
Taken from `wikipedia <https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)>`_.
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |
+========+============================+================+========================+=============================================================+===========================================================================================================================+
| Lu | Letter, uppercase | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Ll | Letter, lowercase | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lt | Letter, titlecase | Graphic | Character | | Ligatures containing uppercase followed by lowercase letters (e.g., Dž , Lj , Nj , and Dz ) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lm | Letter, modifier | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lo | Letter, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Mn | Mark, nonspacing | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Mc | Mark, spacing combining | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Me | Mark, enclosing | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Nd | Number, decimal digit | Graphic | Character | | All these, and only these, have Numeric Type = De |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Nl | Number, letter | Graphic | Character | | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| No | Number, other | Graphic | Character | | E.g., vulgar fractions , superscript and subscript digits |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pc | Punctuation, connector | Graphic | Character | | Includes "_" underscore |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pd | Punctuation, dash | Graphic | Character | | Includes several hyphen characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Ps | Punctuation, open | Graphic | Character | | Opening bracket characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pe | Punctuation, close | Graphic | Character | | Closing bracket characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pi | Punctuation, initial quote | Graphic | Character | | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pf | Punctuation, final quote | Graphic | Character | | Closing quotation mark. May behave like Ps or Pe depending on usage |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Po | Punctuation, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sm | Symbol, math | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sc | Symbol, currency | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sk | Symbol, modifier | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| So | Symbol, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zs | Separator, space | Graphic | Character | | Includes the space, but not TAB , CR , or LF , which are Cc |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zl | Separator, line | Format | Character | | Only U+2028 LINE SEPARATOR (LSEP) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zp | Separator, paragraph | Format | Character | | Only U+2029 PARAGRAPH SEPARATOR (PSEP) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cc | Other, control | Control | Character | Fixed 65 | No name , <control> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cf | Other, format | Format | Character | | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , <surrogate> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name , <private-use> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , <noncharacter> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cn | Other, not assigned | Reserved | Not | Not fixed | No name , <reserved> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
=============
Unicode category database, generated on setup.
This module exposes a category dictionary containing `RangeGroup`s, an
immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's `range`, open at the end.
This method have been chosen for memory efficiency, storing individually all
characters on memory would take a lot of memory.
RangeGroup class provides the following methods:
* **RangeGroup.characters(self)**::
Get iterator with all characters on this range group.
:yields: iterator of characters (str of size 1)
:ytype: str
* **RangeGroup.codes(self)**::
Get iterator for all unicode code points contained in this range group.
:yields: iterator of character index (int)
:ytype: int
* **RangeGroup.has(self, character)**::
Get if character (or character code point) is contained by any range on
this range group.
:param character: character or unicode code point to look for
:type character: str or int
:returns: True if character is contained by any range, False otherwise
:rtype: bool
Example
-------
.. code:: python
from unicategories import categories
upperchars = categories['Lu'].characters() # iterator
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
# Unicode uppercase caracters are "ABCDEF..."
Unicode categories
------------------
Taken from `wikipedia <https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)>`_.
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |
+========+============================+================+========================+=============================================================+===========================================================================================================================+
| Lu | Letter, uppercase | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Ll | Letter, lowercase | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lt | Letter, titlecase | Graphic | Character | | Ligatures containing uppercase followed by lowercase letters (e.g., Dž , Lj , Nj , and Dz ) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lm | Letter, modifier | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lo | Letter, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Mn | Mark, nonspacing | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Mc | Mark, spacing combining | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Me | Mark, enclosing | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Nd | Number, decimal digit | Graphic | Character | | All these, and only these, have Numeric Type = De |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Nl | Number, letter | Graphic | Character | | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| No | Number, other | Graphic | Character | | E.g., vulgar fractions , superscript and subscript digits |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pc | Punctuation, connector | Graphic | Character | | Includes "_" underscore |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pd | Punctuation, dash | Graphic | Character | | Includes several hyphen characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Ps | Punctuation, open | Graphic | Character | | Opening bracket characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pe | Punctuation, close | Graphic | Character | | Closing bracket characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pi | Punctuation, initial quote | Graphic | Character | | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pf | Punctuation, final quote | Graphic | Character | | Closing quotation mark. May behave like Ps or Pe depending on usage |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Po | Punctuation, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sm | Symbol, math | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sc | Symbol, currency | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sk | Symbol, modifier | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| So | Symbol, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zs | Separator, space | Graphic | Character | | Includes the space, but not TAB , CR , or LF , which are Cc |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zl | Separator, line | Format | Character | | Only U+2028 LINE SEPARATOR (LSEP) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zp | Separator, paragraph | Format | Character | | Only U+2029 PARAGRAPH SEPARATOR (PSEP) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cc | Other, control | Control | Character | Fixed 65 | No name , <control> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cf | Other, format | Format | Character | | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , <surrogate> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name , <private-use> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , <noncharacter> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cn | Other, not assigned | Reserved | Not | Not fixed | No name , <reserved> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unicategories-0.0.6.tar.gz
(11.6 kB
view hashes)