Skip to main content

Clean, filter, normalize, and sample URLs

Project description

Python package Python versions Code Coverage

Features

Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text.

  • URL validation and (basic) normalization
  • Filters targeting spam and unsuitable content-types
  • Sampling by domain name
  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.5 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Usage

courlan is designed to work best on English, German and most frequent European languages.

The current logic of detailed/strict URL filtering is focused on English and German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Python

All operations chained:

>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# optional argument targeting webpages in English or German
>>> url, domain_name = check_url(my_url, with_redirects=True, language='en')
>>> url, domain_name = check_url(my_url, with_redirects=True, language='de')

Helper function, scrub and normalize:

>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Basic scrubbing only:

>>> from courlan import scrub_url

Basic normalization only:

>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Sampling by domain name:

>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Determine if a link leads to another host:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Other useful functions:

  • fix_relative_urls(): prepend necessary information to relative links
  • get_base_url(): strip the URL of some of its parts
  • get_host_and_path(): decompose URLs in two parts: protocol + host/domain and path
  • get_hostinfo(): extract domain and host info (protocol + host/domain)

Other filters:

  • is_not_crawlable(url): check for deep web or pages generally not usable in a crawling context
  • is_navigation_page(url): check for navigation and overview pages
  • lang_filter(url, language): heuristics concerning internationalization in URLs

Command-line

The main fonctions are also available through a command-line utility.

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
[–strict] [-l {de,en}] [-r] [–sample] [–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]
optional arguments:
-h, --help show this help message and exit
I/O:

Manage input and output

-i INPUTFILE, --inputfile INPUTFILE
 name of input file (required)
-o OUTPUTFILE, --outputfile OUTPUTFILE
 name of output file (required)
-d DISCARDEDFILE, --discardedfile DISCARDEDFILE
 name of file to store discarded URLs (optional)
-v, --verbose increase output verbosity
Filtering:

Configure URL filters

--strict perform more restrictive tests
-l, --language use language filter {de,en}
-r, --redirects
 check redirects
Sampling:

Use sampling by host, configure sample size

--sample use sampling
--samplesize SAMPLESIZE
 size of sample per domain
--exclude-max EXCLUDE_MAX
 exclude domains with more than n URLs
--exclude-min EXCLUDE_MIN
 exclude domains with less than n URLs

Additional scripts

Scripts designed to handle URL lists are found under helpers.

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

Contact: see homepage or GitHub.

Similar work

These Python libraries perform similar normalization tasks but don’t entail language or content filters. They also don’t necessarily focus on crawl optimization:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for courlan, version 0.4.2
Filename, size File type Python version Upload date Hashes
Filename, size courlan-0.4.2-py3-none-any.whl (22.1 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size courlan-0.4.2.tar.gz (194.1 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page