Unicode blocks data utility module
Project description
🧱 Unicode_Blocks 🧱
unicode_blocks is a simple utility module for working with Unicode blocks data. Unicode blocks are continuous ranges of code points defined by the Unicode standard, used to group characters with generally similar purposes or origins.
Usage
Install this package from PyPI:
pip install unicode-blocks-py
The module interface is heavily inspired by Java Character.UnicodeBlock class and Rust unicode_blocks module.
>>> import unicode_blocks
>>> unicode_major_version = int(unicode_blocks.__version__.split(".")[0])
# To get Unicode block of a character, input a character string of length 1,
# UTF-8 encoded bytes, or a positive integer representing a Unicode code point.
# The following are the same: they decode the character 'a'.
>>> block = unicode_blocks.of('a')
>>> block2 = unicode_blocks.of(b'\x61')
>>> block3 = unicode_blocks.of(97)
>>> assert block == block2 == block3
# To get Unicode block using name, input the block name.
# Cases, whitespace, dashes, underscrolls and prefix "is" will be ignored for comparison. See UAX44-LM3.
# Block name aliases from PropertyValueAliases are also usable here
>>> ascii_block = unicode_blocks.for_name("BASIC_LATIN")
>>> ascii_block2 = unicode_blocks.for_name("basiclatin")
>>> ascii_block3 = unicode_blocks.for_name("isBasicLatin")
>>> from unicode_blocks import BASIC_LATIN
>>> assert ascii_block == ascii_block2 == ascii_block3 == BASIC_LATIN
>>> if unicode_major_version >= 6:
... ascii_block4 = unicode_blocks.for_name("ASCII")
... assert ascii_block4 == BASIC_LATIN
# Unicode characters currently not assigned will receive No_Block object as per
# rule D10b in Section 3.4, *Characters and Encoding*, of Unicode
>>> assert unicode_blocks.of(0xEDCBA) == unicode_blocks.NO_BLOCK
# List through all the defined Unicode blocks at the version
# NO_BLOCK is not in the list of all blocks
>>> for block in unicode_blocks.all():
... print(block) # doctest: +ELLIPSIS
UnicodeBlock(...)
# Pythonic helpers: comparisons between blocks, where earlier blocks is smaller than later blocks
# useful for sorting a list of UnicodeBlocks
>>> latin1_block = unicode_blocks.for_name("Latin-1 Supplement")
>>> assert ascii_block < latin1_block
# Get the total defined code points in a block. Does not represent if the block is filled in or not.
>>> assert len(ascii_block) == 128
# Additional helpers: check for assigned characters in the block
# Data is loaded from UCD and may change between Unicode versions
>>> assert len(ascii_block.assigned_ranges) == 128
>>> assert 'B' in ascii_block.assigned_ranges
# Example where defined Unicode block range is not fully utilised
>>> bopo_block = unicode_blocks.of('ㄅ')
>>> assert len(bopo_block) == 48
>>> bopo_assigned_count = 41 if unicode_major_version < 10 else 42 if unicode_major_version == 10 else 43
>>> assert len(bopo_block.assigned_ranges) == bopo_assigned_count # first 5 code points should be unassigned, at least in <=17.0
>>> assert len(bopo_block) != len(bopo_block.assigned_ranges)
The lists of Unicode block objects are available directly in the namespace, or under the blocks module.
# both are equivalent
>>> from unicode_blocks import BASIC_LATIN
>>> from unicode_blocks.blocks import BASIC_LATIN
Various names are also available in the block:
>>> from unicode_blocks import BASIC_LATIN
>>> assert BASIC_LATIN.name == "Basic Latin" # Official Unicode name as in Blocks.txt
>>> assert BASIC_LATIN.normalised_name == "BASICLATIN" # Normalised name under UAX44-LM3
>>> assert BASIC_LATIN.variable_name == "BASIC_LATIN" # Variable name in `unicode_blocks.blocks`
>>> if unicode_major_version >= 6:
... assert BASIC_LATIN.aliases == ["ASCII"] # Official block aliases as in PropertyValueAliases.txt
Additional utilities for CJK are specially provided referencing the oxidised version of the module. Selected samples are shown below.
>>> from unicode_blocks import cjk
>>> assert cjk.is_cjk('中')
>>> assert cjk.is_japanese_kana('あ')
>>> assert cjk.is_korean_hangul('글')
>>> assert cjk.is_cjk_punctuation('。')
>>> from unicode_blocks import blocks
>>> assert cjk.is_ideographic_block(blocks.CJK_UNIFIED_IDEOGRAPHS)
>>> assert cjk.is_cjk_block(blocks.KANGXI_RADICALS)
>>> assert cjk.is_japanese_block(blocks.KATAKANA_PHONETIC_EXTENSIONS)
>>> assert cjk.is_korean_block(blocks.HANGUL_COMPATIBILITY_JAMO)
[!WARNING]
Checkingchar in unicode_blocks.for_name("is_CJK")is NOT the same ascjk.is_cjk(char)!
unicode_blocks.for_name("is_CJK")refers to the "CJK" block alias for CJK Unified Ideographs block, whilecjk.is_cjkchecks through (roughly) all Unicode blocks related to CJK including kana, hangul and punctuations.
To check which Unicode version data is used, check against the __version__ variable in the namespace. (Bug fix release will use +1 notation)
$ python3
>>> import unicode_blocks
>>> unicode_blocks.__version__ # doctest: +SKIP
'17.0.0'
The version will follow the Unicode semver of the data files, optionally followed by additional numbering from this module for bug fixes after a plus sign, i.e. <Unicode major.minor.patch>(+<additional numbering>).
Update
To update the blocks data from Unicode Character Database, update the project.version key in pyproject.toml to the Unicode version number, and then run python3 build_blocks.py. This will update the src/unicode_blocks/blocks.py file, which is automatically generated from UCD data.
Most of these steps should be directly runnable through GitHub Actions.
Contributing
Contributions are welcome! Please follow these steps:
- Clone the repository and install as development mode:
git clone https://github.com/NightFurySL2001/unicode-blocks.git cd unicode-blocks pip install -e .
- Create a new branch for your feature or bug fix.
- Work on the feature and run or develop relevant test cases.
- Test the changes by running
pytest. - Ensure this README.md is updated with
python -m doctest README.md. - Submit a pull request with a clear description of your changes.
License
This project is licensed under the MIT License.
Acknowledgments
- Unicode Consortium for maintaining the Unicode standard and providing the Unicode Character Database (UCD). Data modification are done under Unicode License v3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unicode_blocks_py-13.0.0.tar.gz.
File metadata
- Download URL: unicode_blocks_py-13.0.0.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42c34fcab17bccaee343a473944a31533bb8ebbb97b0d0bdb8114a63c61f3ba9
|
|
| MD5 |
53cc915a9a84d41a2820c5a84abfe317
|
|
| BLAKE2b-256 |
dc390ebdc683fc0e3310de19c560b4a6cfce252de8ff727a0e53325f19213d64
|
Provenance
The following attestation bundles were made for unicode_blocks_py-13.0.0.tar.gz:
Publisher:
release-pypi.yaml on NightFurySL2001/unicode-blocks-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unicode_blocks_py-13.0.0.tar.gz -
Subject digest:
42c34fcab17bccaee343a473944a31533bb8ebbb97b0d0bdb8114a63c61f3ba9 - Sigstore transparency entry: 692564532
- Sigstore integration time:
-
Permalink:
NightFurySL2001/unicode-blocks-py@b88b236c358ba4f4c708093609745ba54f5e909d -
Branch / Tag:
refs/tags/13.0.0 - Owner: https://github.com/NightFurySL2001
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-pypi.yaml@b88b236c358ba4f4c708093609745ba54f5e909d -
Trigger Event:
release
-
Statement type:
File details
Details for the file unicode_blocks_py-13.0.0-py3-none-any.whl.
File metadata
- Download URL: unicode_blocks_py-13.0.0-py3-none-any.whl
- Upload date:
- Size: 23.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e85593f76bbbff5084fcb9085ccba8624ffe8da9b66fa06a67722808d521ab39
|
|
| MD5 |
910677242c52ee33cbfeedde06cf2209
|
|
| BLAKE2b-256 |
5b6bd4113919e4cf8b597b3549edc8be9763e3db967b61d352ed59ba564c8070
|
Provenance
The following attestation bundles were made for unicode_blocks_py-13.0.0-py3-none-any.whl:
Publisher:
release-pypi.yaml on NightFurySL2001/unicode-blocks-py
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unicode_blocks_py-13.0.0-py3-none-any.whl -
Subject digest:
e85593f76bbbff5084fcb9085ccba8624ffe8da9b66fa06a67722808d521ab39 - Sigstore transparency entry: 692564535
- Sigstore integration time:
-
Permalink:
NightFurySL2001/unicode-blocks-py@b88b236c358ba4f4c708093609745ba54f5e909d -
Branch / Tag:
refs/tags/13.0.0 - Owner: https://github.com/NightFurySL2001
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-pypi.yaml@b88b236c358ba4f4c708093609745ba54f5e909d -
Trigger Event:
release
-
Statement type: