Unirange is a notation for specifying multiple Unicode codepoints.
Project description
Unirange
Unirange is a notation for specifying multiple Unicode codepoints.
A unirange comprises comma-delimited components.
A part is a notation for a single character, like A, U+2600, or 0x7535.
It is matched by the regular expression !?(?:0x|U\+|&#x)([0-9A-F]{1,7});?|(.)
A range is two parts split by .. (two dots) or - (a hyphen).
It is matched by the regular expression (?PART(?:-|\.\.)PART)
A component comprises either a range or a part.
It is matched by the regular expression (RANGE|PART)
The full unirange notation is matched by the regular expression (?:COMPONENT, ?)*
Exclusion can be applied to any component by prefixing it with a !.
This will instead perform the difference (subtraction) on the current set of characters.
Table of contents
๐ About
Component
A component is either a range, or a part. These components define what characters are included or excluded by the unirange.
Part
A part is a single character notation.
In a range, there exist two parts, split by .. or -.
In the range U+2600..U+26FF, U+2600 and U+26FF are parts.
Parts can match any of these regular expressions:
U\+.{1,6}&#x.{1,6}0x.{1,6}.
If more than one character is in a part, and it is not prefixed, it is invalid.
For example, 2600 is not a valid part, but U+2600 is.
There is no way to specify a codepoint in a base system other than hexadecimal.
Ӓis not valid.
Range
A range is two parts separated by .. or -.
Implied infinite expansion
If either (but not both) part of the range is absent, it is called implied infinite expansion (IIE). With IIE, the range's boundaries are implied to become to lower or upper limits of the Unicode character set.
If the first part is absent, the first part becomes U+0000. If the second part is absent, it becomes U+10FFFF. If both parts are absent, the range is invalid.
This means that the range U+2600.. will result in characters from U+2600 to U+10FFFF.
It is semantically equivalent to U+2600..U+10FFFF.
This also applies to the reverse: the range ..U+2600 will result in characters from U+0000 to U+2600.
Likewise, it is equivalent to U+0000..U+2600.
Exclusion
To exclude a character from being included in a resulting range, prefix a component with a !.
This will prevent it from being included in the range, regardless of what other parts indicate.
For example, U+2600..U+26FF, U+2704, !U+2605 will include the codepoints from U+2600 up to U+2605,
and then from U+2606 to U+26FF, as well as U+2704.
You can exclude ranges as well. Either part of a range may be prefixed with a ! to label that part as an
exclusion. !U+2600..U+267F, !U+2600..!U+267F, and !U+2600..!U+267F result in the same range:
no codepoints from U+2600 to U+267F.
Exclusions must come after the inclusions, or else they will be overridden.
The order of your components matters when excluding. Components after an exclusion that conflict with it will obsolete it, overriding it. For example,
!U+2600..U+2650,U+2600..U+26FFwill result in the effective range ofU+2600-26FF.
๐ฆ Installation
unirange is available on PyPI.
It requires a Python version of at least 3.11.0.
To install unirange with pip, run:
python -m pip install unirange
"externally-managed-environment"
This error occurs on some Linux distributions such as Fedora 38 and Ubuntu 23.04. It can be solved by either:
- Using a virtual environment (venv)
- Using pipx
๐ Usage
Using unirange is simple.
>>> import unirange
>>> unirange.unirange_to_characters("A..Z")
{'G', 'D', 'I', 'K', 'X', 'J', 'V', 'O', 'H', 'C', 'A', 'B', 'Y', 'F', 'P', 'W', 'L', 'M', 'R', 'S', 'E', 'T', 'Z', 'N', 'U', 'Q'}
>>> unirange.unirange_to_characters("..0")
{'\x19', '0', '\x1c', '#', '\x14', '\x0c', '\x01', '\x0e', '\r', '\t', '+', '.', '%', '\x18', '\x15', '\x12', '\x16', '\x05', '!', '\x1b', '/', '\x17', '\x0b', '&', '\x1d', '\n', '\x1e', '\x10', '"', "'", '\x04', '\x1a', '(', ' ', '\x08', '\x07', '\x03', ')', '\x1f', '\x02', '\x13', '$', '-', '\x11', ',', '\x00', '*', '\x06', '\x0f'}
>>> unirange.unirange_to_characters("U+2600..U+26FF, !U+2610..")
{'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ
', 'โ', 'โ'}
>>> unirange.unirange_to_characters("U+2600....")
unirange.UnirangeError: Invalid unirange notation: U+2600....
>>> unirange.unirange_to_characters("U+2600..U+10000")
{'์ณ', 'ไฟ', '้', '็ง', 'ๅผ', 'ๆบณ', 'ใ', '๊ฑ', '์คฟ', '์ฃ', 'ไ', '๊', '\ue548', '่ฑด', '์ดซ', 'ไชป', 'ไฑ', '่นพ', 'ํ', '็
', '\uea1f', ...}
It can also be used in CLI:
$ python -m unirange U+2600..U+2610
โ โ โ โ โ โ
โ โ โ โ โ โ โ โ โ โ โ
$ python -m unirange U+2600
โ
$ python -m unirange 'U+2600..,!U+2650..'
โ โ โ โ โ โ
โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โก โข โฃ โค โฅ โฆ โง โจ โฉ โช โซ โฌ โญ โฎ โฏ โฐ โฑ โฒ โณ โด โต โถ โท โธ โน โบ โป โผ โฝ โพ โฟ โ โ โ โ โ โ
โ โ โ โ โ โ โ โ โ โ
For some uniranges, you may need to wrap the argument in
'or else the shell will interpret them oddly:$ python -m unirange U+2600..,!U+2650.. bash: !U+2650..: event not found $ python -m unirange 'U+2600..,!U+2650..' # Works as expected.
๐ฐ Changelog
The changelog is at CHANGELOG.md.
๐ License
unirange is licensed under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unirange-1.0.tar.gz.
File metadata
- Download URL: unirange-1.0.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ac369421a7d17726d991e09e7b4f9c3ee40bc2c58c24211e91197d0ebd904a9
|
|
| MD5 |
a932e35b3d017369d2cb726c14afb501
|
|
| BLAKE2b-256 |
2d5bba2694d0a77af9ba215985a530d85a7d78523e3c35c1b3f17027627b7388
|
File details
Details for the file unirange-1.0-py3-none-any.whl.
File metadata
- Download URL: unirange-1.0-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bfb7187cf6764a3c0de8722d00616f3c02a49f27de03f880414e1bacf2845ae
|
|
| MD5 |
9e3a019fa9f7fcfebde5be352fd371c1
|
|
| BLAKE2b-256 |
8c5229a481eb96ac2d653a37b4b6c5130f2864f8fc3c25ec9b97d353517ed1cf
|