Skip to main content

Parsing and validation of URIs (RFC 3986) and IRIs (RFC 3987)

Project description

This module provides regular expressions according to `RFC 3986 "Uniform
Resource Identifier (URI): Generic Syntax"
<>`_ and `RFC 3987 "Internationalized
Resource Identifiers (IRIs)" <>`_, and
utilities for composition and relative resolution of references.

Tested on python 2.7, 3.2 and 3.3. Some features require regex_.

Note for python<=3.2: characters beyond the Basic Multilingual Plane are not
supported on narrow builds (see `issue12729


**match** (string, rule='IRI_reference')
Convenience function for checking if `string` matches a specific rule.

Returns a match object or None::

>>> assert match('%C7X', 'pct_encoded') is None
>>> assert match('%C7', 'pct_encoded')
>>> assert match('%c7', 'pct_encoded')

**parse** (string, rule='IRI_reference')
Parses `string` according to `rule` into a dict of subcomponents.

If `rule` is None, parse an IRI_reference `without validation

If regex_ is available, any rule is supported; with re_, `rule` must be
'IRI_reference' or some special case thereof ('IRI', 'absolute_IRI',
'irelative_ref', 'irelative_part', 'URI_reference', 'URI', 'absolute_URI',
'relative_ref', 'relative_part'). ::

>>> d = parse('',
... rule='URI')
>>> assert all([ d['scheme'] == 'http',
... d['authority'] == '',
... d['path'] == '/html/rfc3986',
... d['query'] == None,
... d['fragment'] == 'appendix-A' ])

**compose** (\*\*parts)
Returns an URI composed_ from named parts.

.. _composed:

**resolve** (base, uriref, strict=True, return_parts=False)
Resolves_ an `URI reference` relative to a `base` URI.

`Test cases <>`_::

>>> base = resolve.test_cases_base
>>> for relative, resolved in resolve.test_cases.items():
... assert resolve(base, relative) == resolved

If `return_parts` is True, returns a dict of named parts instead of
a string.


>>> assert resolve('urn:rootless', '../../name') == 'urn:name'
>>> assert resolve('urn:root/less', '../../name') == 'urn:/name'
>>> assert resolve('http://a/b', 'http:g') == 'http:g'
>>> assert resolve('http://a/b', 'http:g', strict=False) == 'http://a/g'

.. _Resolves:

A dict of regular expressions with useful group names.
Compilable (with regex_ only) without need for any particular compilation

Alternative versions of `patterns`.
[u]nicode strings without group names for the re_ module.
BMP only for narrow builds.

**get_compiled_pattern** (rule, flags=0)
Returns a compiled pattern object for a rule name or template string.

Usage for validation::

>>> uri = get_compiled_pattern('^%(URI)s$')
>>> assert uri.match('')
>>> assert not get_compiled_pattern('^%(relative_ref)s$').match('#f#g')
>>> from unicodedata import lookup
>>> smp = 'urn:' + lookup('OLD ITALIC LETTER A') # U+00010300
>>> assert not uri.match(smp)
>>> m = get_compiled_pattern('^%(IRI)s$').match(smp)

On narrow builds, non-BMP characters are (incorreclty) excluded::

>>> assert NARROW_BUILD == (not m)

For parsing, some subcomponents are captured in named groups (*only if*
regex_ is available, otherwise see `parse`)::

>>> match = uri.match('')
>>> d = match.groupdict()
>>> if REGEX:
... assert all([ d['scheme'] == 'http',
... d['authority'] == '',
... d['path'] == '/html/rfc3986',
... d['query'] == None,
... d['fragment'] == 'appendix-A' ])

>>> for r in patterns.keys():
... assert get_compiled_pattern(r)

**format_patterns** (\*\*names)
Returns a dict of patterns (regular expressions) keyed by
`rule names for URIs`_ and `rule names for IRIs`_.

See also the module level dicts of patterns, and `get_compiled_pattern`.

To wrap a rule in a named capture group, pass it as keyword argument:
rule_name='group_name'. By default, the formatted patterns contain no
named groups.

Patterns are `str` instances (be it in python 2.x or 3.x) containing ASCII
characters only.


- with re_, named capture groups cannot occur on multiple branches of an

- with re_ before python 3.3, ``\u`` and ``\U`` escapes must be
preprocessed (see `issue3665 <>`_)

- on narrow builds, character ranges beyond BMP are not supported

.. _rule names for URIs:
.. _rule names for IRIs:

What's new

version 1.3.4:

- allowed for lower case percent encoding

version 1.3.3:

- fixed a bug in `resolve` which left "../" at the begining of some paths

version 1.3.2:

- convenience function `match`
- patterns restricted to the BMP for narrow builds
- adapted doctests for python 3.3
- compatibility with python 2.6 (thanks to Thijs Janssen)

version 1.3.1:

- some re_ compatibility: get_compiled_pattern, parse
- dropped regex_ from requirements

version 1.3.0:

- python 3.x compatibility
- format_patterns

version 1.2.1:

- compose, resolve

.. _re:
.. _regex:

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for rfc3987, version 1.3.4
Filename, size File type Python version Upload date Hashes
Filename, size rfc3987-1.3.4.tar.gz (7.6 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page