Unirange is a notation for specifying multiple Unicode codepoints.
Project description
Unirange
Unirange is a notation for specifying multiple Unicode codepoints.
A unirange comprises comma-delimited components.
A part is a notation for a single character, like A
, U+2600
, or 0x7535
.
It is matched by the regular expression !?(?:0x|U\+|&#x)([0-9A-F]{1,7});?|(.)
A range is two parts split by ..
(two dots) or -
(a hyphen).
It is matched by the regular expression (?PART(?:-|\.\.)PART)
A component comprises either a range or a part.
It is matched by the regular expression (RANGE|PART)
The full unirange notation is matched by the regular expression (?:COMPONENT, ?)*
Exclusion can be applied to any component by prefixing it with a !
.
This will instead perform the difference (subtraction) on the current set of characters.
Table of contents
๐ About
Component
A component is either a range, or a part. These components define what characters are included or excluded by the unirange.
Part
A part is a single character notation.
In a range, there exist two parts, split by ..
or -
.
In the range U+2600..U+26FF
, U+2600
and U+26FF
are parts.
Parts can match any of these regular expressions:
U\+.{1,6}
&#x.{1,6}
0x.{1,6}
.
If more than one character is in a part, and it is not prefixed, it is invalid.
For example, 2600
is not a valid part, but U+2600
is.
There is no way to specify a codepoint in a base system other than hexadecimal.
Ӓ
is not valid.
Range
A range is two parts separated by ..
or -
.
Implied infinite expansion
If either (but not both) part of the range is absent, it is called implied infinite expansion (IIE). With IIE, the range's boundaries are implied to become to lower or upper limits of the Unicode character set.
If the first part is absent, the first part becomes U+0000. If the second part is absent, it becomes U+10FFFF. If both parts are absent, the range is invalid.
This means that the range U+2600..
will result in characters from U+2600 to U+10FFFF.
It is semantically equivalent to U+2600..U+10FFFF
.
This also applies to the reverse: the range ..U+2600
will result in characters from U+0000 to U+2600.
Likewise, it is equivalent to U+0000..U+2600
.
Exclusion
To exclude a character from being included in a resulting range, prefix a component with a !
.
This will prevent it from being included in the range, regardless of what other parts indicate.
For example, U+2600..U+26FF, U+2704, !U+2605
will include the codepoints from U+2600 up to U+2605,
and then from U+2606 to U+26FF, as well as U+2704.
You can exclude ranges as well. Either part of a range may be prefixed with a !
to label that part as an
exclusion. !U+2600..U+267F
, !U+2600..!U+267F
, and !U+2600..!U+267F
result in the same range:
no codepoints from U+2600 to U+267F.
Exclusions must come after the inclusions, or else they will be overridden.
The order of your components matters when excluding. Components after an exclusion that conflict with it will obsolete it, overriding it. For example,
!U+2600..U+2650,U+2600..U+26FF
will result in the effective range ofU+2600-26FF
.
๐ฆ Installation
unirange
is available on PyPI.
It requires a Python version of at least 3.11.0.
To install unirange with pip, run:
python -m pip install unirange
"externally-managed-environment"
This error occurs on some Linux distributions such as Fedora 38 and Ubuntu 23.04. It can be solved by either:
- Using a virtual environment (venv)
- Using pipx
๐ Usage
Using unirange
is simple.
>>> import unirange
>>> unirange.unirange_to_characters("A..Z")
{'G', 'D', 'I', 'K', 'X', 'J', 'V', 'O', 'H', 'C', 'A', 'B', 'Y', 'F', 'P', 'W', 'L', 'M', 'R', 'S', 'E', 'T', 'Z', 'N', 'U', 'Q'}
>>> unirange.unirange_to_characters("..0")
{'\x19', '0', '\x1c', '#', '\x14', '\x0c', '\x01', '\x0e', '\r', '\t', '+', '.', '%', '\x18', '\x15', '\x12', '\x16', '\x05', '!', '\x1b', '/', '\x17', '\x0b', '&', '\x1d', '\n', '\x1e', '\x10', '"', "'", '\x04', '\x1a', '(', ' ', '\x08', '\x07', '\x03', ')', '\x1f', '\x02', '\x13', '$', '-', '\x11', ',', '\x00', '*', '\x06', '\x0f'}
>>> unirange.unirange_to_characters("U+2600..U+26FF, !U+2610..")
{'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ', 'โ
', 'โ', 'โ'}
>>> unirange.unirange_to_characters("U+2600....")
unirange.UnirangeError: Invalid unirange notation: U+2600....
>>> unirange.unirange_to_characters("U+2600..U+10000")
{'์ณ', 'ไฟ', '้', '็ง', 'ๅผ', 'ๆบณ', 'ใ', '๊ฑ', '์คฟ', '์ฃ', 'ไ', '๊', '\ue548', '่ฑด', '์ดซ', 'ไชป', 'ไฑ', '่นพ', 'ํ', '็
', '\uea1f', ...}
It can also be used in CLI:
$ python -m unirange U+2600..U+2610
โ โ โ โ โ โ
โ โ โ โ โ โ โ โ โ โ โ
$ python -m unirange U+2600
โ
$ python -m unirange 'U+2600..,!U+2650..'
โ โ โ โ โ โ
โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โก โข โฃ โค โฅ โฆ โง โจ โฉ โช โซ โฌ โญ โฎ โฏ โฐ โฑ โฒ โณ โด โต โถ โท โธ โน โบ โป โผ โฝ โพ โฟ โ โ โ โ โ โ
โ โ โ โ โ โ โ โ โ โ
For some uniranges, you may need to wrap the argument in
'
or else the shell will interpret them oddly:$ python -m unirange U+2600..,!U+2650.. bash: !U+2650..: event not found $ python -m unirange 'U+2600..,!U+2650..' # Works as expected.
๐ฐ Changelog
The changelog is at CHANGELOG.md.
๐ License
unirange
is licensed under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.