Skip to main content

Clean, filter, normalize, and sample URLs

Project description

Python package Python versions Travis build status Code Coverage

Features

Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text.

  • URL validation and (basic) normalization

  • Filters targeting spam and unsuitable content-types

  • Sampling by domain name

  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Usage

courlan is designed to work best on English, German and most frequent European languages.

The current logic of detailed/strict URL filtering is focused on English and German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Python

All operations chained:

>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# optional argument targeting webpages in English or German
>>> url, domain_name = check_url(my_url, with_redirects=True, language='en')
>>> url, domain_name = check_url(my_url, with_redirects=True, language='de')

Helper function, scrub and normalize:

>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Basic scrubbing only:

>>> from courlan import scrub_url

Basic normalization only:

>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Sampling by domain name:

>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Determine if a link leads to another host:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Command-line

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]

[–strict] [-l {de,en}] [-r] [–sample] [–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]

optional arguments:
-h, --help

show this help message and exit

I/O:

Manage input and output

-i INPUTFILE, --inputfile INPUTFILE

name of input file (required)

-o OUTPUTFILE, --outputfile OUTPUTFILE

name of output file (required)

-d DISCARDEDFILE, --discardedfile DISCARDEDFILE

name of file to store discarded URLs (optional)

-v, --verbose

increase output verbosity

Filtering:

Configure URL filters

--strict

perform more restrictive tests

-l, --language

use language filter {de,en}

-r, --redirects

check redirects

Sampling:

Use sampling by host, configure sample size

--sample

use sampling

--samplesize SAMPLESIZE

size of sample per domain

--exclude-max EXCLUDE_MAX

exclude domains with more than n URLs

--exclude-min EXCLUDE_MIN

exclude domains with less than n URLs

Additional scripts

Scripts designed to handle URL lists are found under helpers.

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

Contact: see homepage or GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

courlan-0.2.3.tar.gz (188.3 kB view details)

Uploaded Source

Built Distribution

courlan-0.2.3-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file courlan-0.2.3.tar.gz.

File metadata

  • Download URL: courlan-0.2.3.tar.gz
  • Upload date:
  • Size: 188.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.2.3.tar.gz
Algorithm Hash digest
SHA256 649756066671c1fdcbef129766300aa1b1c5b2cf5bcdedcb0aadcd7f09cd5e6b
MD5 63b42bd96291371916a59fdfcf720b5c
BLAKE2b-256 e5ffbe61cbd455f0d8c630c8eee593961a1cd0ec86231cf64878d001c5da3aef

See more details on using hashes here.

File details

Details for the file courlan-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: courlan-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 16b22e6b98838469793ce6c4b9501d7a7eff679c227a4d3c135349d1da12f623
MD5 e03a483594ba0e959b662e8302b9e44d
BLAKE2b-256 9c705fb086e1c9a29344621f4989ba6be46fb487784f1fc0cb79476e2800e022

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page