Skip to main content

A helper library full of URL-related heuristics.

Project description

Build Status

Ural

A helper library full of URL-related heuristics.

Installation

You can install ural with pip with the following command:

pip install ural

Usage


ensure_protocol

A function checking if the url has a protocol, and adding the given one if there is none.

from ural import ensure_protocol

ensure_protocol('www2.lemonde.fr', protocol='https')
>>> 'https://www2.lemonde.fr'

Arguments

  • url string: URL to format.
  • protocol string: protocol to use if there is none in url. Is 'http' by default.

force_protocol

A function force-replacing the protocol of the given url.

from ural import force_protocol

force_protocol('https://www2.lemonde.fr', protocol='ftp')
>>> 'ftp://www2.lemonde.fr'

Arguments

  • url string: URL to format.
  • protocol string: protocol wanted in the output url. Is 'http' by default.

is_url

A function returning True if its argument is a url.

from ural import is_url

is_url('https://www2.lemonde.fr')
>>> True

Arguments

  • string string: string to test.
  • require_protocol boolean: whether the argument has to have a protocol to be considered a url. Is True by default.

normalize_url

Function normalizing the given url by stripping it of usually non-discriminant parts such as irrelevant query items or sub-domains etc.

This is a very useful utility when attempting to match similar urls written slightly differently when shared on social media etc.

from ural import normalize_url

normalize_url('https://www2.lemonde.fr/index.php?utm_source=google')
>>> 'lemonde.fr'

Arguments

  • url string: URL to normalize.
  • sort_query boolean [True]: whether to sort query items.
  • strip_authentication boolean [True]: whether to strip authentication.
  • strip_index boolean [True]: whether to strip trailing index.
  • strip_trailing_slash boolean [False]: whether to strip trailing slash.

strip_protocol

Function removing the protocol from the url.

from ural import strip_protocol

strip_protocol('https://www2.lemonde.fr/index.php')
>>> 'www2.lemonde.fr/index.php'

Arguments

  • url string: URL to format.

urls_from_html

A function returning an iterator over the urls present in the links of given HTML text.

from ural import urls_from_html

html = """<p>Hey! Check this site: <a href="https://medialab.sciencespo.fr/">médialab</a></p>"""

for url in urls_from_html(html):
    print(url)
>>> 'https://medialab.sciencespo.fr/'

Arguments

  • string string: html string.

urls_from_text

A function returning an iterator over the urls present in the string argument. Extracts only the urls with a protocol.

from ural import urls_from_text

text = "Hey! Check this site: https://medialab.sciencespo.fr/, it looks really cool. They're developing many tools on https://github.com/"

for url in urls_from_text(text):
    print(url)
>>> 'https://medialab.sciencespo.fr/'
>>> 'https://github.com/'

Arguments

  • string string: source string.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
ural-0.5.0-py3-none-any.whl (8.5 kB) Copy SHA256 hash SHA256 Wheel py3
ural-0.5.0.tar.gz (5.1 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page