Skip to main content

convenience functions for working with URLs

Project description

URL related utility functions and classes.

Latest release 20260531:

  • Require html5lib and lxml and python 3 urllib modules.
  • URL.flush: clean out defined cached attributes.
  • URL: new session() context manager to make a requests.Session.
  • URL: hrefs, srcs: return a URLs collection.
  • URL: Promotable and Formatable.
  • URL: replace caching methods .GET() and .HEAD() with @cached_property .GET_response and .HEAD_response.
  • URL: new .url_parsed property being the namedtuple from urlparse, drop .parts.
  • URL: new .query_dict() method, returning the query parameters as a dict.
  • UR: new .cleanpath and .cleanrpath properties.
  • URL: new .urlto(other_url) to resolve other_url against self, use it in hrefs() and srcs().
  • URL: rename content_type to content_type_full, make content_type the plain text/html value.
  • URL: make .text a cached_property, get the soup using just lxml (the list-=of-parsers approach seems unsupported).
  • URL: new .short attrubute being a shortend URL for messages.
  • URL: new .ext property for the URL file extension.
  • URL: new isabs() method to test is a URL has a hostname and a path commencing with /
  • URL: support extending a URL with /

Short summary:

  • NetrcHTTPPasswordMgr: A subclass of HTTPPasswordMgrWithDefaultRealm that consults the .netrc file if no overriding credentials have been stored.
  • skip_url_errs: A version of cs.seq.skip_map which skips URLError and HTTPError.
  • strip_whitespace: Strip whitespace characters from a string, per HTML 4.01 section 1.6 and appendix E.
  • URL: Utility class to do simple stuff to URLs, subclasses str.
  • urljoin: This is urllib.parse.urljoin after coercing both arguments to str.

Module contents:

  • class NetrcHTTPPasswordMgr(urllib.request.HTTPPasswordMgrWithDefaultRealm): A subclass of HTTPPasswordMgrWithDefaultRealm that consults the .netrc file if no overriding credentials have been stored.
  • skip_url_errs(func, *iterables, **skip_map_kw): A version of cs.seq.skip_map which skips URLError and HTTPError.
  • strip_whitespace(s): Strip whitespace characters from a string, per HTML 4.01 section 1.6 and appendix E.
  • class URL(cs.threads.HasThreadState, cs.lex.FormatableMixin, cs.deco.Promotable): Utility class to do simple stuff to URLs, subclasses str.

URL.__init__(self, url_s: str, referer=None, soup=None, text=None): Initialise the URL from the URL string url_s.

URL.__getattr__(self, attr): Ad hoc attributes. Upper case attributes named "FOO" parse the text and find the (sole) node named "foo". Upper case attributes named "FOOs" parse the text and find all the nodes named "foo".

URL.__truediv__(self, subpath): Return a new URL with subpath appended.

URL.basename: The URL basename.

URL.cleanrpath: The cleanpath with its leading slash stripped.

URL.content: The decoded URL content as a bytes.

URL.content_length: The value of the Content-Length: header or None.

URL.content_transfer_encoding: The URL content tranfer encoding.

URL.context

URL.default_limit(self): Default URLLimit for this URL: same host:port, any subpath.

URL.domain: The URL domain - the hostname with the first dotted component removed.

URL.exists(self) -> bool: Test if this URL exists via a HEAD request.

URL.ext: The URL basename file extension, as from os.path.splitext.

URL.feedparsed(self): A parse of the content via the feedparser module.

URL.find_all(self, *a, **kw): Convenience routine to call BeautifulSoup's .find_all() method.

URL.flush(self): Forget all cached content.

URL.format_kwargs(self): Return a dict for use with FormatableMixin.format_as().

URL.fragment: The URL fragment as returned by urlparse.urlparse.

URL.headers: A requests.Response headers mapping.

URL.hostname: The URL hostname as returned by urlparse.urlparse.

URL.hrefs(self, absolute=False) -> Iterable[ForwardRef('URL')]: All 'href=' values from the content HTML 'A' tags. If absolute, resolve the sources with respect to our URL.

URL.isabs(self): Test whether this URL is absolute, having a hostname and a path commencing with '/'.

URL.last_modified: The value of the Last-Modified: header as a UNIX timestamp, or None.

URL.netloc: The URL netloc as returned by urlparse.urlparse.

URL.normalised(self): Return a normalised URL where "." and ".." components have been processed.

URL.params: The URL params as returned by urlparse.urlparse.

URL.password: The URL password as returned by urlparse.urlparse.

URL.path: The URL path as returned by urlparse.urlparse.

URL.path_elements: Return the non-empty path components; NB: a new list every time.

URL.port: The URL port as returned by urlparse.urlparse.

URL.promote(obj): Promote obj to an instance of cls. Instances of cls are passed through unchanged. str is promoted directly to cls(obj). (url,referer) is promoted to cls(url,referer=referer).

URL.query: The URL query as returned by urlparse.urlparse.

URL.query_dict(self): Return a new dict containing the parsed param=value pairs from self.query.

URL.resolve(self, base): Resolve this URL with respect to a base URL.

URL.rpath: The URL path as returned by urlparse.urlparse, after any leading slashes.

URL.savepath(self, rootdir): Compute a local filesystem save pathname for this URL. This scheme is designed to accomodate the fact that 'a', 'a/' and 'a/b' can all coexist. Extend any component ending in '.' with another '.'. Extend directory components with '.d.'.

URL.scheme: The URL scheme as returned by urlparse.urlparse.

URL.session(self, session=None): Context manager yielding a requests.Session.

URL.short: A shortened form of the URL for use in messages.

URL.srcs(self, *a, **kw): All 'src=' values from the content HTML. If absolute, resolve the sources with respect to our URL.

URL.unsavepath(savepath): Compute URL path component from a savepath as returned by URL.savepath. This should always round trip with URL.savepath.

URL.urlto(self, other: Union[ForwardRef('URL'), str]) -> 'URL': Return other resolved against self.baseurl. If other is an abolute URL it will not be changed.

URL.username: The URL username as returned by urlparse.urlparse.

URL.walk(self, limit=None, seen=None, follow_redirects=False): Walk a website from this URL yielding this and all descendent URLs. limit: an object with a contraint test method "ok". If not supplied, limit URLs to the same host and port. seen: a setlike object with a "contains" method and an "add" method. URLs already in the set will not be yielded or visited. follow_redirects: whether to follow URL redirects

URL.xml_find_all(self, match): Convenience routine to call ElementTree.XML's .findall() method.

  • urljoin(url, other_url): This is urllib.parse.urljoin after coercing both arguments to str.

Release Log

Release 20260531:

  • Require html5lib and lxml and python 3 urllib modules.
  • URL.flush: clean out defined cached attributes.
  • URL: new session() context manager to make a requests.Session.
  • URL: hrefs, srcs: return a URLs collection.
  • URL: Promotable and Formatable.
  • URL: replace caching methods .GET() and .HEAD() with @cached_property .GET_response and .HEAD_response.
  • URL: new .url_parsed property being the namedtuple from urlparse, drop .parts.
  • URL: new .query_dict() method, returning the query parameters as a dict.
  • UR: new .cleanpath and .cleanrpath properties.
  • URL: new .urlto(other_url) to resolve other_url against self, use it in hrefs() and srcs().
  • URL: rename content_type to content_type_full, make content_type the plain text/html value.
  • URL: make .text a cached_property, get the soup using just lxml (the list-=of-parsers approach seems unsupported).
  • URL: new .short attrubute being a shortend URL for messages.
  • URL: new .ext property for the URL file extension.
  • URL: new isabs() method to test is a URL has a hostname and a path commencing with /
  • URL: support extending a URL with /

Release 20231129:

  • Drop Python 2 support.
  • No longer use cs.xml, which is going away.
  • Make _URL type public as URL with a new promote() method, drop URL factory function, update URL constructors throughout.
  • URL.init: make parameters keyword only.

Release 20191004: Small updates for changes to other modules.

Release 20160828: Use "install_requires" instead of "requires" in DISTINFO.

Release 20160827:

  • Handle TimeoutError, reporting elapsed time.
  • URL: present ._fetch as .GET.
  • URL: add .resolve to resolve this URL against a base URL.
  • URL: add .savepath and .unsavepath methods to generate nonconflicting save pathnames for URLs and the reverse.
  • URL._fetch: record the post-redirection URL as final_url.
  • New URLLimit class for specifying simple tests for URL acceptance.
  • New walk(): method to walk website from starting URL, yielding URLs.
  • URL.content_length property, returns int or None if header missing.
  • New URL.normalised method to return URL with . and .. processed in the path.
  • new URL.exists test function.
  • Assorted bugfixes and improvements.

Release 20150116: Initial PyPI release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cs_urlutils-20260531.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cs_urlutils-20260531-py2.py3-none-any.whl (11.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file cs_urlutils-20260531.tar.gz.

File metadata

  • Download URL: cs_urlutils-20260531.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for cs_urlutils-20260531.tar.gz
Algorithm Hash digest
SHA256 1bac7fa0623ebeb25003afa3c1897a5abca7991dd5561bcb9c1085d42cfa49cb
MD5 10a7654e24421f159fee6a103a6eab79
BLAKE2b-256 f7bb11766130d3fe6573ac3663fab850eb51140f0b067bb3113d8d41d33a6f16

See more details on using hashes here.

File details

Details for the file cs_urlutils-20260531-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for cs_urlutils-20260531-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6745c4da577ff22d4748417a9795e9560e4c590aad36793364e694ff96b91319
MD5 f1180666b8a610f22d13e4f03bf38927
BLAKE2b-256 d23287252720c2152f5f3fef49f95076345238dfa722db045dbea5d1f7ad39dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page