Skip to main content

Clean, filter, normalize, and sample URLs

Project description

Python package Python versions Travis build status Code Coverage

Features

Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text.

  • URL validation and (basic) normalization

  • Filters targeting spam and unsuitable content-types

  • Sampling by domain name

  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Usage

courlan is designed to work best on English, German and most frequent European languages.

The current logic of detailed/strict URL filtering is focused on English and German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Python

All operations chained:

>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# optional argument targeting webpages in English or German
>>> url, domain_name = check_url(my_url, with_redirects=True, language='en')
>>> url, domain_name = check_url(my_url, with_redirects=True, language='de')

Helper function, scrub and normalize:

>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Basic scrubbing only:

>>> from courlan import scrub_url

Basic normalization only:

>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Sampling by domain name:

>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Determine if a link leads to another host:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Command-line

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]

[–strict] [-l {de,en}] [-r] [–sample] [–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]

optional arguments:
-h, --help

show this help message and exit

I/O:

Manage input and output

-i INPUTFILE, --inputfile INPUTFILE

name of input file (required)

-o OUTPUTFILE, --outputfile OUTPUTFILE

name of output file (required)

-d DISCARDEDFILE, --discardedfile DISCARDEDFILE

name of file to store discarded URLs (optional)

-v, --verbose

increase output verbosity

Filtering:

Configure URL filters

--strict

perform more restrictive tests

-l, --language

use language filter {de,en}

-r, --redirects

check redirects

Sampling:

Use sampling by host, configure sample size

--sample

use sampling

--samplesize SAMPLESIZE

size of sample per domain

--exclude-max EXCLUDE_MAX

exclude domains with more than n URLs

--exclude-min EXCLUDE_MIN

exclude domains with less than n URLs

Additional scripts

Scripts designed to handle URL lists are found under helpers.

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

Contact: see homepage or GitHub.

Similar work

These Python libraries perform similar normalization tasks but don’t entail language or content filters:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

courlan-0.3.1.tar.gz (190.6 kB view details)

Uploaded Source

Built Distribution

courlan-0.3.1-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file courlan-0.3.1.tar.gz.

File metadata

  • Download URL: courlan-0.3.1.tar.gz
  • Upload date:
  • Size: 190.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.3.1.tar.gz
Algorithm Hash digest
SHA256 9231fb2d92c0a6ee161934727022fdd97f23cf7ecf244acf7611b1c529c9895a
MD5 403c894847c3250d4f821c682f7d7cf8
BLAKE2b-256 e7ce5d68dbee8ea7e9e04bea6a91b7068e489574c0d54832633882c3a6e75112

See more details on using hashes here.

File details

Details for the file courlan-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: courlan-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e8fa712f4e5dc55751a3a91b6367e99b39cb09223f379411e22e89108d00f003
MD5 8c59363ba784662ca00e591917bcbe74
BLAKE2b-256 f861bb440c980fcacbf7d1bf09423c54c6f25b6559ae492ba8cff2081eff3e04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page