Skip to main content

Unicode category database

Project description

unicategories
=============

Unicode category database, generated on setup.

This module exposes a category dictionary containing `RangeGroup`s, an
immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's `range`, open at the end.

This method have been chosen for memory efficiency, storing individually all
characters on memory would take a lot of memory.

RangeGroup class provides the following methods:

* **RangeGroup.characters(self)**::

Get iterator with all characters on this range group.

:yields: iterator of characters (str of size 1)
:ytype: str

* **RangeGroup.codes(self)**::

Get iterator for all unicode code points contained in this range group.

:yields: iterator of character index (int)
:ytype: int

* **RangeGroup.has(self, character)**::

Get if character (or character code point) is contained by any range on
this range group.

:param character: character or unicode code point to look for
:type character: str or int
:returns: True if character is contained by any range, False otherwise
:rtype: bool

Example
-------

.. code:: python

from unicategories import categories

upperchars = categories['Lu'].characters() # iterator
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
# Unicode uppercase caracters are "ABCDEF..."

Unicode categories
------------------

Taken from `wikipedia <https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)>`_.

+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |
+========+============================+================+========================+=============================================================+===========================================================================================================================+
| Lu | Letter, uppercase | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Ll | Letter, lowercase | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lt | Letter, titlecase | Graphic | Character | | Ligatures containing uppercase followed by lowercase letters (e.g., Dž , Lj , Nj , and Dz ) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lm | Letter, modifier | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Lo | Letter, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Mn | Mark, nonspacing | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Mc | Mark, spacing combining | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Me | Mark, enclosing | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Nd | Number, decimal digit | Graphic | Character | | All these, and only these, have Numeric Type = De |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Nl | Number, letter | Graphic | Character | | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| No | Number, other | Graphic | Character | | E.g., vulgar fractions , superscript and subscript digits |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pc | Punctuation, connector | Graphic | Character | | Includes "_" underscore |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pd | Punctuation, dash | Graphic | Character | | Includes several hyphen characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Ps | Punctuation, open | Graphic | Character | | Opening bracket characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pe | Punctuation, close | Graphic | Character | | Closing bracket characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pi | Punctuation, initial quote | Graphic | Character | | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Pf | Punctuation, final quote | Graphic | Character | | Closing quotation mark. May behave like Ps or Pe depending on usage |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Po | Punctuation, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sm | Symbol, math | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sc | Symbol, currency | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Sk | Symbol, modifier | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| So | Symbol, other | Graphic | Character | | |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zs | Separator, space | Graphic | Character | | Includes the space, but not TAB , CR , or LF , which are Cc |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zl | Separator, line | Format | Character | | Only U+2028 LINE SEPARATOR (LSEP) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Zp | Separator, paragraph | Format | Character | | Only U+2029 PARAGRAPH SEPARATOR (PSEP) |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cc | Other, control | Control | Character | Fixed 65 | No name , <control> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cf | Other, format | Format | Character | | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , <surrogate> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name , <private-use> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , <noncharacter> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Cn | Other, not assigned | Reserved | Not | Not fixed | No name , <reserved> |
+--------+----------------------------+----------------+------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicategories-0.0.6.tar.gz (11.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page