convenience functions for working with URLs
Project description
URL related utility functions and classes.
- Cameron Simpson cs@cskk.id.au 26dec2011
Latest release 20260531:
- Require html5lib and lxml and python 3 urllib modules.
- URL.flush: clean out defined cached attributes.
- URL: new session() context manager to make a requests.Session.
- URL: hrefs, srcs: return a URLs collection.
- URL: Promotable and Formatable.
- URL: replace caching methods .GET() and .HEAD() with @cached_property .GET_response and .HEAD_response.
- URL: new .url_parsed property being the namedtuple from urlparse, drop .parts.
- URL: new .query_dict() method, returning the query parameters as a dict.
- UR: new .cleanpath and .cleanrpath properties.
- URL: new .urlto(other_url) to resolve other_url against self, use it in hrefs() and srcs().
- URL: rename content_type to content_type_full, make content_type the plain text/html value.
- URL: make .text a cached_property, get the soup using just lxml (the list-=of-parsers approach seems unsupported).
- URL: new .short attrubute being a shortend URL for messages.
- URL: new .ext property for the URL file extension.
- URL: new isabs() method to test is a URL has a hostname and a path commencing with /
- URL: support extending a URL with /
Short summary:
NetrcHTTPPasswordMgr: A subclass ofHTTPPasswordMgrWithDefaultRealmthat consults the.netrcfile if no overriding credentials have been stored.skip_url_errs: A version ofcs.seq.skip_mapwhich skipsURLErrorandHTTPError.strip_whitespace: Strip whitespace characters from a string, per HTML 4.01 section 1.6 and appendix E.URL: Utility class to do simple stuff to URLs, subclassesstr.urljoin: This isurllib.parse.urljoinafter coercing both arguments tostr.
Module contents:
class NetrcHTTPPasswordMgr(urllib.request.HTTPPasswordMgrWithDefaultRealm): A subclass ofHTTPPasswordMgrWithDefaultRealmthat consults the.netrcfile if no overriding credentials have been stored.skip_url_errs(func, *iterables, **skip_map_kw): A version ofcs.seq.skip_mapwhich skipsURLErrorandHTTPError.strip_whitespace(s): Strip whitespace characters from a string, per HTML 4.01 section 1.6 and appendix E.class URL(cs.threads.HasThreadState, cs.lex.FormatableMixin, cs.deco.Promotable): Utility class to do simple stuff to URLs, subclassesstr.
URL.__init__(self, url_s: str, referer=None, soup=None, text=None):
Initialise the URL from the URL string url_s.
URL.__getattr__(self, attr):
Ad hoc attributes.
Upper case attributes named "FOO" parse the text and find
the (sole) node named "foo".
Upper case attributes named "FOOs" parse the text and find
all the nodes named "foo".
URL.__truediv__(self, subpath):
Return a new URL with subpath appended.
URL.basename:
The URL basename.
URL.cleanrpath:
The cleanpath with its leading slash stripped.
URL.content:
The decoded URL content as a bytes.
URL.content_length:
The value of the Content-Length: header or None.
URL.content_transfer_encoding:
The URL content tranfer encoding.
URL.context
URL.default_limit(self):
Default URLLimit for this URL: same host:port, any subpath.
URL.domain:
The URL domain - the hostname with the first dotted component removed.
URL.exists(self) -> bool:
Test if this URL exists via a HEAD request.
URL.ext:
The URL basename file extension, as from os.path.splitext.
URL.feedparsed(self):
A parse of the content via the feedparser module.
URL.find_all(self, *a, **kw):
Convenience routine to call BeautifulSoup's .find_all() method.
URL.flush(self):
Forget all cached content.
URL.format_kwargs(self):
Return a dict for use with FormatableMixin.format_as().
URL.fragment:
The URL fragment as returned by urlparse.urlparse.
URL.headers:
A requests.Response headers mapping.
URL.hostname:
The URL hostname as returned by urlparse.urlparse.
URL.hrefs(self, absolute=False) -> Iterable[ForwardRef('URL')]:
All 'href=' values from the content HTML 'A' tags.
If absolute, resolve the sources with respect to our URL.
URL.isabs(self):
Test whether this URL is absolute, having a hostname and
a path commencing with '/'.
URL.last_modified:
The value of the Last-Modified: header as a UNIX timestamp, or None.
URL.netloc:
The URL netloc as returned by urlparse.urlparse.
URL.normalised(self):
Return a normalised URL where "." and ".." components have been processed.
URL.params:
The URL params as returned by urlparse.urlparse.
URL.password:
The URL password as returned by urlparse.urlparse.
URL.path:
The URL path as returned by urlparse.urlparse.
URL.path_elements:
Return the non-empty path components; NB: a new list every time.
URL.port:
The URL port as returned by urlparse.urlparse.
URL.promote(obj):
Promote obj to an instance of cls.
Instances of cls are passed through unchanged.
str is promoted directly to cls(obj).
(url,referer) is promoted to cls(url,referer=referer).
URL.query:
The URL query as returned by urlparse.urlparse.
URL.query_dict(self):
Return a new dict containing the parsed param=value pairs from self.query.
URL.resolve(self, base):
Resolve this URL with respect to a base URL.
URL.rpath:
The URL path as returned by urlparse.urlparse, after any leading slashes.
URL.savepath(self, rootdir):
Compute a local filesystem save pathname for this URL.
This scheme is designed to accomodate the fact that 'a',
'a/' and 'a/b' can all coexist.
Extend any component ending in '.' with another '.'.
Extend directory components with '.d.'.
URL.scheme:
The URL scheme as returned by urlparse.urlparse.
URL.session(self, session=None):
Context manager yielding a requests.Session.
URL.short:
A shortened form of the URL for use in messages.
URL.srcs(self, *a, **kw):
All 'src=' values from the content HTML.
If absolute, resolve the sources with respect to our URL.
URL.unsavepath(savepath):
Compute URL path component from a savepath as returned by URL.savepath.
This should always round trip with URL.savepath.
URL.urlto(self, other: Union[ForwardRef('URL'), str]) -> 'URL':
Return other resolved against self.baseurl.
If other is an abolute URL it will not be changed.
URL.username:
The URL username as returned by urlparse.urlparse.
URL.walk(self, limit=None, seen=None, follow_redirects=False):
Walk a website from this URL yielding this and all descendent URLs.
limit: an object with a contraint test method "ok".
If not supplied, limit URLs to the same host and port.
seen: a setlike object with a "contains" method and an "add" method.
URLs already in the set will not be yielded or visited.
follow_redirects: whether to follow URL redirects
URL.xml_find_all(self, match):
Convenience routine to call ElementTree.XML's .findall() method.
Release Log
Release 20260531:
- Require html5lib and lxml and python 3 urllib modules.
- URL.flush: clean out defined cached attributes.
- URL: new session() context manager to make a requests.Session.
- URL: hrefs, srcs: return a URLs collection.
- URL: Promotable and Formatable.
- URL: replace caching methods .GET() and .HEAD() with @cached_property .GET_response and .HEAD_response.
- URL: new .url_parsed property being the namedtuple from urlparse, drop .parts.
- URL: new .query_dict() method, returning the query parameters as a dict.
- UR: new .cleanpath and .cleanrpath properties.
- URL: new .urlto(other_url) to resolve other_url against self, use it in hrefs() and srcs().
- URL: rename content_type to content_type_full, make content_type the plain text/html value.
- URL: make .text a cached_property, get the soup using just lxml (the list-=of-parsers approach seems unsupported).
- URL: new .short attrubute being a shortend URL for messages.
- URL: new .ext property for the URL file extension.
- URL: new isabs() method to test is a URL has a hostname and a path commencing with /
- URL: support extending a URL with /
Release 20231129:
- Drop Python 2 support.
- No longer use cs.xml, which is going away.
- Make _URL type public as URL with a new promote() method, drop URL factory function, update URL constructors throughout.
- URL.init: make parameters keyword only.
Release 20191004: Small updates for changes to other modules.
Release 20160828: Use "install_requires" instead of "requires" in DISTINFO.
Release 20160827:
- Handle TimeoutError, reporting elapsed time.
- URL: present ._fetch as .GET.
- URL: add .resolve to resolve this URL against a base URL.
- URL: add .savepath and .unsavepath methods to generate nonconflicting save pathnames for URLs and the reverse.
- URL._fetch: record the post-redirection URL as final_url.
- New URLLimit class for specifying simple tests for URL acceptance.
- New walk(): method to walk website from starting URL, yielding URLs.
- URL.content_length property, returns int or None if header missing.
- New URL.normalised method to return URL with . and .. processed in the path.
- new URL.exists test function.
- Assorted bugfixes and improvements.
Release 20150116: Initial PyPI release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cs_urlutils-20260531.tar.gz.
File metadata
- Download URL: cs_urlutils-20260531.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bac7fa0623ebeb25003afa3c1897a5abca7991dd5561bcb9c1085d42cfa49cb
|
|
| MD5 |
10a7654e24421f159fee6a103a6eab79
|
|
| BLAKE2b-256 |
f7bb11766130d3fe6573ac3663fab850eb51140f0b067bb3113d8d41d33a6f16
|
File details
Details for the file cs_urlutils-20260531-py2.py3-none-any.whl.
File metadata
- Download URL: cs_urlutils-20260531-py2.py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6745c4da577ff22d4748417a9795e9560e4c590aad36793364e694ff96b91319
|
|
| MD5 |
f1180666b8a610f22d13e4f03bf38927
|
|
| BLAKE2b-256 |
d23287252720c2152f5f3fef49f95076345238dfa722db045dbea5d1f7ad39dc
|