Simple and efficient Python data types for URIs and IRIs

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

XRI

XRI is a small Python library for efficient and RFC-correct representation of URIs and IRIs. It is currently work-in-progress and, as such, is not recommended for production environments.

The generic syntax for URIs is defined in RFC 3986. This is extended in the IRI specification, RFC 3987, to support extended characters outside of the ASCII range. The URI and IRI types defined in this library implement those definitions and store their constituent parts as bytes or str values respectively.

Creating a URI or IRI

To get started, simply pass a string value into the URI or IRI constructor. These can both accept either bytes or str values, and will encode or decode UTF-8 values as required.

>>> from xri import URI
>>> uri = URI("http://alice@example.com/a/b/c?q=x#z")
>>> uri
<URI scheme=b'http' authority=URI.Authority(b'example.com', userinfo=b'alice') \
     path=URI.Path(b'/a/b/c') query=b'q=x' fragment=b'z'>
>>> uri.scheme = "https"
>>> print(uri)
https://alice@example.com/a/b/c?q=x#z

Component parts

Each URI or IRI object is fully mutable, allowing any component parts to be get, set, or deleted. The following component parts are available:

URI/IRI object
- .scheme (None or string)
- .authority (None or Authority object)
  - .userinfo (None or string)
  - .host (string)
  - .port (None, string or int)
- .path (Path object - can be used as an iterable of segment strings)
- .query (None or Query object)
- .fragment (None or string)

(The type "string" here refers to bytes or bytearray for URI objects, and str for IRI objects.)

Percent encoding and decoding

Each of the URI and IRI classes has class methods called pct_encode and pct_decode. These operate slightly differently, depending on the base class, as a slightly different set of characters are kept "safe" during encoding.

>>> URI.pct_encode("abc/def")
'abc%2Fdef'
>>> URI.pct_encode("abc/def", safe="/")
'abc/def'
>>> URI.pct_encode("20% of $125 is $25")
'20%25%20of%20%24125%20is%20%2425'
>>> URI.pct_encode("20% of £125 is £25")                        # '£' is encoded with UTF-8
'20%25%20of%20%C2%A3125%20is%20%C2%A325'
>>> IRI.pct_encode("20% of £125 is £25")                        # '£' is safe within an IRI
'20%25%20of%20£125%20is%20£25'
>>> URI.pct_decode('20%25%20of%20%C2%A3125%20is%20%C2%A325')    # str in, str out (using UTF-8)
'20% of £125 is £25'
>>> URI.pct_decode(b'20%25%20of%20%C2%A3125%20is%20%C2%A325')   # bytes in, bytes out (no UTF-8)
b'20% of \xc2\xa3125 is \xc2\xa325'

Safe characters (passed in via the safe argument) can only be drawn from the set below. Other characters passed to this argument will give a ValueError.

! # $ & ' ( ) * + , / : ; = ? @ [ ]

Advantages over built-in `urllib.parse` module

Correct handling of character encodings

RFC 3986 specifies that extended characters (beyond the ASCII range) are not supported directly within URIs. When used, these should always be encoded with UTF-8 before percent encoding. IRIs (defined in RFC 3987) do however allow such characters.

urllib.parse does not enforce this behaviour according to the RFCs, and does not support UTF-8 encoded bytes as input values.

>>> urlparse("https://example.com/ä").path
'/ä'
>>> urlparse("https://example.com/ä".encode("utf-8")).path
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 20: ordinal not in range(128)

Conversely, xri handles these scenarios correctly according to the RFCs.

>>> URI("https://example.com/ä").path
URI.Path(b'/%C3%A4')
>>> URI("https://example.com/ä".encode("utf-8")).path
URI.Path(b'/%C3%A4')
>>> IRI("https://example.com/ä").path
IRI.Path('/ä')
>>> IRI("https://example.com/ä".encode("utf-8")).path
IRI.Path('/ä')

Optional components may be empty

Optional URI components, such as query and fragment are allowed to be present but empty, according to RFC 3986. As such, there is a semantic difference between an empty component and a missing component. When composed, this will be denoted by the absence or presence of a marker character ('?' in the case of the query component).

The urlparse function does not distinguish between empty and missing components; both are treated as "missing".

>>> urlparse("https://example.com/a").geturl()
'https://example.com/a'
>>> urlparse("https://example.com/a?").geturl()
'https://example.com/a'

xri, on the other hand, correctly distinguishes between these cases:

>>> str(URI("https://example.com/a"))
'https://example.com/a'
>>> str(URI("https://example.com/a?"))
'https://example.com/a?'

Project details

These details have not been verified by PyPI

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.7.1

Oct 31, 2023

0.7.0

Oct 31, 2023

0.0.0

Aug 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xri-0.7.1.tar.gz (17.6 kB view hashes)

Uploaded Oct 31, 2023 Source

Hashes for xri-0.7.1.tar.gz

Hashes for xri-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`3c78572bd9d1622af1d29beacc8bee597d51d8e03c93d21551392e045ebd2006`
MD5	`fd9cff155f716c9a8fb332a5a20283fc`
BLAKE2b-256	`6a282064400e7f5644cceadb07d7f1b2371b8c7c453190f1f0350dc474e88960`

xri 0.7.1

Navigation

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Project description

XRI

Creating a URI or IRI

Component parts

Percent encoding and decoding

Advantages over built-in `urllib.parse` module

Correct handling of character encodings

Optional components may be empty

Project details

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

xri 0.7.1

Navigation

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Project description

XRI

Creating a URI or IRI

Component parts

Percent encoding and decoding

Advantages over built-in urllib.parse module

Correct handling of character encodings

Optional components may be empty

Project details

Verified details

Maintainers

Unverified details

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Advantages over built-in `urllib.parse` module