Skip to main content

Unirange is a notation for specifying multiple Unicode codepoints.

Project description

Unirange

Code style: black Pylint License PyPi GitLab Release (latest by SemVer)

Unirange is a notation for specifying multiple Unicode codepoints.

A unirange comprises comma-delimited components.

A part is a notation for a single character, like A, U+2600, or 0x7535. It is matched by the regular expression !?(?:0x|U\+|&#x)([0-9A-F]{1,7});?|(.)

A range is two parts split by .. (two dots) or - (a hyphen). It is matched by the regular expression (?PART(?:-|\.\.)PART)

A component comprises either a range or a part. It is matched by the regular expression (RANGE|PART)

The full unirange notation is matched by the regular expression (?:COMPONENT, ?)*

Exclusion can be applied to any component by prefixing it with a !. This will instead perform the difference (subtraction) on the current set of characters.


Table of contents


๐Ÿ“„ About

Component

A component is either a range, or a part. These components define what characters are included or excluded by the unirange.

Part

A part is a single character notation. In a range, there exist two parts, split by .. or -. In the range U+2600..U+26FF, U+2600 and U+26FF are parts.

Parts can match any of these regular expressions:

  • U\+.{1,6}
  • &#x.{1,6}
  • 0x.{1,6}
  • .

If more than one character is in a part, and it is not prefixed, it is invalid. For example, 2600 is not a valid part, but U+2600 is.

There is no way to specify a codepoint in a base system other than hexadecimal. &#1234 is not valid.

Range

A range is two parts separated by .. or -.

Implied infinite expansion

If either (but not both) part of the range is absent, it is called implied infinite expansion (IIE). With IIE, the range's boundaries are implied to become to lower or upper limits of the Unicode character set.

If the first part is absent, the first part becomes U+0000. If the second part is absent, it becomes U+10FFFF. If both parts are absent, the range is invalid.

This means that the range U+2600.. will result in characters from U+2600 to U+10FFFF. It is semantically equivalent to U+2600..U+10FFFF.

This also applies to the reverse: the range ..U+2600 will result in characters from U+0000 to U+2600. Likewise, it is equivalent to U+0000..U+2600.

Exclusion

To exclude a character from being included in a resulting range, prefix a component with a !. This will prevent it from being included in the range, regardless of what other parts indicate.

For example, U+2600..U+26FF, U+2704, !U+2605 will include the codepoints from U+2600 up to U+2605, and then from U+2606 to U+26FF, as well as U+2704.

You can exclude ranges as well. Either part of a range may be prefixed with a ! to label that part as an exclusion. !U+2600..U+267F, !U+2600..!U+267F, and !U+2600..!U+267F result in the same range: no codepoints from U+2600 to U+267F.

Exclusions must come after the inclusions, or else they will be overridden.

The order of your components matters when excluding. Components after an exclusion that conflict with it will obsolete it, overriding it. For example, !U+2600..U+2650,U+2600..U+26FF will result in the effective range of U+2600-26FF.


๐Ÿ“ฆ Installation

unirange is available on PyPI. It requires a Python version of at least 3.11.0.

To install unirange with pip, run:

python -m pip install unirange

"externally-managed-environment"

This error occurs on some Linux distributions such as Fedora 38 and Ubuntu 23.04. It can be solved by either:

  1. Using a virtual environment (venv)
  2. Using pipx

๐Ÿ›  Usage

Using unirange is simple.

>>> import unirange
>>> unirange.unirange_to_characters("A..Z")
{'G', 'D', 'I', 'K', 'X', 'J', 'V', 'O', 'H', 'C', 'A', 'B', 'Y', 'F', 'P', 'W', 'L', 'M', 'R', 'S', 'E', 'T', 'Z', 'N', 'U', 'Q'}

>>> unirange.unirange_to_characters("..0")
{'\x19', '0', '\x1c', '#', '\x14', '\x0c', '\x01', '\x0e', '\r', '\t', '+', '.', '%', '\x18', '\x15', '\x12', '\x16', '\x05', '!', '\x1b', '/', '\x17', '\x0b', '&', '\x1d', '\n', '\x1e', '\x10', '"', "'", '\x04', '\x1a', '(', ' ', '\x08', '\x07', '\x03', ')', '\x1f', '\x02', '\x13', '$', '-', '\x11', ',', '\x00', '*', '\x06', '\x0f'}

>>> unirange.unirange_to_characters("U+2600..U+26FF, !U+2610..")
{'โ˜Œ', 'โ˜', 'โ˜‚', 'โ˜‰', 'โ˜', 'โ˜‹', 'โ˜€', 'โ˜„', 'โ˜ƒ', 'โ˜ˆ', 'โ˜†', 'โ˜Š', 'โ˜‡', 'โ˜…', 'โ˜', 'โ˜Ž'}

>>> unirange.unirange_to_characters("U+2600....")
unirange.UnirangeError: Invalid unirange notation: U+2600....

>>> unirange.unirange_to_characters("U+2600..U+10000")
{'์ณ', 'ไ”ฟ', '้•”', '็ง', 'ๅ—ผ', 'ๆบณ', 'ใŸ', '๊ฑ•', '์คฟ', '์ฃ•', 'ไ‘€', '๊•€', '\ue548', '่ฑด', '์ดซ', 'ไชป', 'ไ‹ฑ', '่นพ', 'ํ‰™', '็ƒ…', '\uea1f', ...}

It can also be used in CLI:

$ python -m unirange U+2600..U+2610
โ˜€ โ˜ โ˜‚ โ˜ƒ โ˜„ โ˜… โ˜† โ˜‡ โ˜ˆ โ˜‰ โ˜Š โ˜‹ โ˜Œ โ˜ โ˜Ž โ˜ โ˜ 
$ python -m unirange U+2600
โ˜€ 
$ python -m unirange 'U+2600..,!U+2650..'
โ˜€ โ˜ โ˜‚ โ˜ƒ โ˜„ โ˜… โ˜† โ˜‡ โ˜ˆ โ˜‰ โ˜Š โ˜‹ โ˜Œ โ˜ โ˜Ž โ˜ โ˜ โ˜‘ โ˜’ โ˜“ โ˜” โ˜• โ˜– โ˜— โ˜˜ โ˜™ โ˜š โ˜› โ˜œ โ˜ โ˜ž โ˜Ÿ โ˜  โ˜ก โ˜ข โ˜ฃ โ˜ค โ˜ฅ โ˜ฆ โ˜ง โ˜จ โ˜ฉ โ˜ช โ˜ซ โ˜ฌ โ˜ญ โ˜ฎ โ˜ฏ โ˜ฐ โ˜ฑ โ˜ฒ โ˜ณ โ˜ด โ˜ต โ˜ถ โ˜ท โ˜ธ โ˜น โ˜บ โ˜ป โ˜ผ โ˜ฝ โ˜พ โ˜ฟ โ™€ โ™ โ™‚ โ™ƒ โ™„ โ™… โ™† โ™‡ โ™ˆ โ™‰ โ™Š โ™‹ โ™Œ โ™ โ™Ž โ™ 

For some uniranges, you may need to wrap the argument in ' or else the shell will interpret them oddly:

$ python -m unirange U+2600..,!U+2650..
bash: !U+2650..: event not found
$ python -m unirange 'U+2600..,!U+2650..'
# Works as expected.

๐Ÿ“ฐ Changelog

The changelog is at CHANGELOG.md.


๐Ÿ“œ License

unirange is licensed under the MIT license.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unirange-1.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

unirange-1.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file unirange-1.0.tar.gz.

File metadata

  • Download URL: unirange-1.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for unirange-1.0.tar.gz
Algorithm Hash digest
SHA256 9ac369421a7d17726d991e09e7b4f9c3ee40bc2c58c24211e91197d0ebd904a9
MD5 a932e35b3d017369d2cb726c14afb501
BLAKE2b-256 2d5bba2694d0a77af9ba215985a530d85a7d78523e3c35c1b3f17027627b7388

See more details on using hashes here.

File details

Details for the file unirange-1.0-py3-none-any.whl.

File metadata

  • Download URL: unirange-1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for unirange-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4bfb7187cf6764a3c0de8722d00616f3c02a49f27de03f880414e1bacf2845ae
MD5 9e3a019fa9f7fcfebde5be352fd371c1
BLAKE2b-256 8c5229a481eb96ac2d653a37b4b6c5130f2864f8fc3c25ec9b97d353517ed1cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page