Skip to main content

Unicode blocks data utility module

Project description

🧱 Unicode_Blocks 🧱

unicode_blocks is a simple utility module for working with Unicode blocks data. Unicode blocks are continuous ranges of code points defined by the Unicode standard, used to group characters with generally similar purposes or origins.

Usage

Install this package from PyPI:

pip install unicode-blocks-py

The module interface is heavily inspired by Java Character.UnicodeBlock class and Rust unicode_blocks module.

>>> import unicode_blocks
>>> unicode_major_version = int(unicode_blocks.__version__.split(".")[0])

# To get Unicode block of a character, input a character string of length 1,
# UTF-8 encoded bytes, or a positive integer representing a Unicode code point.
# The following are the same: they decode the character 'a'.
>>> block = unicode_blocks.of('a')
>>> block2 = unicode_blocks.of(b'\x61')
>>> block3 = unicode_blocks.of(97)
>>> assert block == block2 == block3

# To get Unicode block using name, input the block name.
# Cases, whitespace, dashes, underscrolls and prefix "is" will be ignored for comparison. See UAX44-LM3.
# Block name aliases from PropertyValueAliases are also usable here
>>> ascii_block = unicode_blocks.for_name("BASIC_LATIN")
>>> ascii_block2 = unicode_blocks.for_name("basiclatin")
>>> ascii_block3 = unicode_blocks.for_name("isBasicLatin")
>>> from unicode_blocks import BASIC_LATIN
>>> assert ascii_block == ascii_block2 == ascii_block3 == BASIC_LATIN
>>> if unicode_major_version >= 6:
...     ascii_block4 = unicode_blocks.for_name("ASCII")
...     assert ascii_block4 == BASIC_LATIN

# Unicode characters currently not assigned will receive No_Block object as per
# rule D10b in Section 3.4, *Characters and Encoding*, of Unicode
>>> assert unicode_blocks.of(0xEDCBA) == unicode_blocks.NO_BLOCK

# List through all the defined Unicode blocks at the version
# NO_BLOCK is not in the list of all blocks
>>> for block in unicode_blocks.all():
...     print(block) # doctest: +ELLIPSIS
UnicodeBlock(...)

# Pythonic helpers: comparisons between blocks, where earlier blocks is smaller than later blocks
# useful for sorting a list of UnicodeBlocks
>>> latin1_block = unicode_blocks.for_name("Latin-1 Supplement")
>>> assert ascii_block < latin1_block

# Get the total defined code points in a block. Does not represent if the block is filled in or not.
>>> assert len(ascii_block) == 128

# Additional helpers: check for assigned characters in the block
# Data is loaded from UCD and may change between Unicode versions
>>> assert len(ascii_block.assigned_ranges) == 128
>>> assert 'B' in ascii_block.assigned_ranges

# Example where defined Unicode block range is not fully utilised
>>> bopo_block = unicode_blocks.of('ㄅ')
>>> assert len(bopo_block) == 48
>>> bopo_assigned_count = 41 if unicode_major_version < 10 else 42 if unicode_major_version == 10 else 43
>>> assert len(bopo_block.assigned_ranges) == bopo_assigned_count  # first 5 code points should be unassigned, at least in <=17.0
>>> assert len(bopo_block) != len(bopo_block.assigned_ranges)

The lists of Unicode block objects are available directly in the namespace, or under the blocks module.

# both are equivalent
>>> from unicode_blocks import BASIC_LATIN
>>> from unicode_blocks.blocks import BASIC_LATIN

Various names are also available in the block:

>>> from unicode_blocks import BASIC_LATIN
>>> assert BASIC_LATIN.name == "Basic Latin"  # Official Unicode name as in Blocks.txt
>>> assert BASIC_LATIN.normalised_name == "BASICLATIN"  # Normalised name under UAX44-LM3
>>> assert BASIC_LATIN.variable_name == "BASIC_LATIN"  # Variable name in `unicode_blocks.blocks`
>>> if unicode_major_version >= 6:
...     assert BASIC_LATIN.aliases == ["ASCII"]  # Official block aliases as in PropertyValueAliases.txt

Additional utilities for CJK are specially provided referencing the oxidised version of the module. Selected samples are shown below.

>>> from unicode_blocks import cjk
>>> assert cjk.is_cjk('中')
>>> assert cjk.is_japanese_kana('あ')
>>> assert cjk.is_korean_hangul('글')
>>> assert cjk.is_cjk_punctuation('。')

>>> from unicode_blocks import blocks
>>> assert cjk.is_ideographic_block(blocks.CJK_UNIFIED_IDEOGRAPHS)
>>> assert cjk.is_cjk_block(blocks.KANGXI_RADICALS)
>>> assert cjk.is_japanese_block(blocks.KATAKANA_PHONETIC_EXTENSIONS)
>>> assert cjk.is_korean_block(blocks.HANGUL_COMPATIBILITY_JAMO)

[!WARNING]
Checking char in unicode_blocks.for_name("is_CJK") is NOT the same as cjk.is_cjk(char)!
unicode_blocks.for_name("is_CJK") refers to the "CJK" block alias for CJK Unified Ideographs block, while cjk.is_cjk checks through (roughly) all Unicode blocks related to CJK including kana, hangul and punctuations.

To check which Unicode version data is used, check against the __version__ variable in the namespace. (Bug fix release will use +1 notation)

$ python3
>>> import unicode_blocks
>>> unicode_blocks.__version__  # doctest: +SKIP
'17.0.0'

The version will follow the Unicode semver of the data files, optionally followed by additional numbering from this module for bug fixes after a plus sign, i.e. <Unicode major.minor.patch>(+<additional numbering>).

Update

To update the blocks data from Unicode Character Database, update the project.version key in pyproject.toml to the Unicode version number, and then run python3 build_blocks.py. This will update the src/unicode_blocks/blocks.py file, which is automatically generated from UCD data.

Most of these steps should be directly runnable through GitHub Actions.

Contributing

Contributions are welcome! Please follow these steps:

  1. Clone the repository and install as development mode:
    git clone https://github.com/NightFurySL2001/unicode-blocks.git
    cd unicode-blocks
    pip install -e .
    
  2. Create a new branch for your feature or bug fix.
  3. Work on the feature and run or develop relevant test cases.
  4. Test the changes by running pytest.
  5. Ensure this README.md is updated with python -m doctest README.md.
  6. Submit a pull request with a clear description of your changes.

License

This project is licensed under the MIT License.

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unicode_blocks_py-10.0.0.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unicode_blocks_py-10.0.0-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file unicode_blocks_py-10.0.0.tar.gz.

File metadata

  • Download URL: unicode_blocks_py-10.0.0.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for unicode_blocks_py-10.0.0.tar.gz
Algorithm Hash digest
SHA256 432b4c2d1c2e1d2602d8d2e137208b10e10b25f8be62509e62dff35f2932c198
MD5 a3949c7cec5eaf909dc6ce9dd3e2ab31
BLAKE2b-256 0d2178cba275c7bf63d3c013b673c8ca40f1867b8826b8142754e24443e62620

See more details on using hashes here.

Provenance

The following attestation bundles were made for unicode_blocks_py-10.0.0.tar.gz:

Publisher: release-pypi.yaml on NightFurySL2001/unicode-blocks-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file unicode_blocks_py-10.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for unicode_blocks_py-10.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e191e1795339a849e1a0fbd414ccb2186d15f7c1aefa7b641139cefff7c19c2
MD5 132d27ce939fbdda200fee79ac6d20ca
BLAKE2b-256 9108c415c8adbbf7224de0accb0da77fd69cabe7a0a39c846a822c09efb7fd0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for unicode_blocks_py-10.0.0-py3-none-any.whl:

Publisher: release-pypi.yaml on NightFurySL2001/unicode-blocks-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page