Skip to main content

Clean, filter, normalize, and sample URLs

Project description

Python package Python versions Travis build status Code Coverage

Features

Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text.

  • URL validation and (basic) normalization

  • Filters targeting spam and unsuitable content-types

  • Sampling by domain name

  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.5 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Usage

courlan is designed to work best on English, German and most frequent European languages.

The current logic of detailed/strict URL filtering is focused on English and German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Python

All operations chained:

>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# optional argument targeting webpages in English or German
>>> url, domain_name = check_url(my_url, with_redirects=True, language='en')
>>> url, domain_name = check_url(my_url, with_redirects=True, language='de')

Helper function, scrub and normalize:

>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Basic scrubbing only:

>>> from courlan import scrub_url

Basic normalization only:

>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Sampling by domain name:

>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Determine if a link leads to another host:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Other useful functions:

  • fix_relative_urls(): prepend necessary information to relative links

  • get_base_url(): strip the URL of some of its parts

  • get_host_and_path(): decompose URLs in two parts: protocol + host/domain and path

  • get_hostinfo(): extract domain and host info (protocol + host/domain)

Other filters:

  • is_not_crawlable(url): check for deep web or pages generally not usable in a crawling context

  • is_navigation_page(url): check for navigation and overview pages

  • lang_filter(url, language): heuristics concerning internationalization in URLs

Command-line

The main fonctions are also available through a command-line utility.

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]

[–strict] [-l {de,en}] [-r] [–sample] [–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]

optional arguments:
-h, --help

show this help message and exit

I/O:

Manage input and output

-i INPUTFILE, --inputfile INPUTFILE

name of input file (required)

-o OUTPUTFILE, --outputfile OUTPUTFILE

name of output file (required)

-d DISCARDEDFILE, --discardedfile DISCARDEDFILE

name of file to store discarded URLs (optional)

-v, --verbose

increase output verbosity

Filtering:

Configure URL filters

--strict

perform more restrictive tests

-l, --language

use language filter {de,en}

-r, --redirects

check redirects

Sampling:

Use sampling by host, configure sample size

--sample

use sampling

--samplesize SAMPLESIZE

size of sample per domain

--exclude-max EXCLUDE_MAX

exclude domains with more than n URLs

--exclude-min EXCLUDE_MIN

exclude domains with less than n URLs

Additional scripts

Scripts designed to handle URL lists are found under helpers.

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

Contact: see homepage or GitHub.

Similar work

These Python libraries perform similar normalization tasks but don’t entail language or content filters. They also don’t necessarily focus on crawl optimization:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

courlan-0.4.0.tar.gz (193.1 kB view details)

Uploaded Source

Built Distribution

courlan-0.4.0-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file courlan-0.4.0.tar.gz.

File metadata

  • Download URL: courlan-0.4.0.tar.gz
  • Upload date:
  • Size: 193.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for courlan-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6b2fa21aeeb11924f5a6a6318a9e1d46bbbff4d7ba789bad9dc726978b2f100b
MD5 2abaa019ceda39bf5a63f836dff4af23
BLAKE2b-256 eda3d86534d47d0e98a65b4ecc2ff09a7d850435e8003dcaea03e56106995ee6

See more details on using hashes here.

File details

Details for the file courlan-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: courlan-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.25.1 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for courlan-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d2075aa3efb970a35f56098625f6161216357061e5e7d1c259955e0de2cd85a
MD5 7c1343c4ceac1bd1fc133ecfea31420c
BLAKE2b-256 e50ccedc2592fdd601b9cb9c26a4c533ab1a505ef3a3fdd7fb9ef21d84ca6633

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page