Skip to main content

Clean, filter, normalize, and sample URLs

Project description

Python package Python versions Travis build status Code Coverage

Features

Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text.

  • URL validation and (basic) normalization

  • Filters targeting spam and unsuitable content-types

  • Sampling by domain name

  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Usage

courlan is designed to work best on English, German and most frequent European languages.

The current logic of detailed/strict URL filtering is on German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Python

All operations chained:

>>> from courlan.core import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# optional argument targeting webpages in German: with_language=False

Helper function, scrub and normalize:

>>> from courlan.clean import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Basic scrubbing only:

>>> from courlan.clean import scrub_url

Basic normalization only:

>>> from urllib.parse import urlparse
>>> from courlan.clean import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>> from courlan.filters import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Sampling by domain name:

>>> from courlan.core import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Command-line

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]

[–strict] [-l] [-r] [–sample] [–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]

optional arguments:
-h, --help

show this help message and exit

I/O:

Manage input and output

-i INPUTFILE, --inputfile INPUTFILE

name of input file (required)

-o OUTPUTFILE, --outputfile OUTPUTFILE

name of output file (required)

-d DISCARDEDFILE, --discardedfile DISCARDEDFILE

name of file to store discarded URLs (optional)

-v, --verbose

increase output verbosity

Filtering:

Configure URL filters

--strict

perform more restrictive tests

-l, --language

use language filter

-r, --redirects

check redirects

Sampling:

Use sampling by host, configure sample size

--sample

use sampling

--samplesize SAMPLESIZE

size of sample per domain

--exclude-max EXCLUDE_MAX

exclude domains with more than n URLs

--exclude-min EXCLUDE_MIN

exclude domains with less than n URLs

Additional scripts

Scripts designed to handle URL lists are found under helpers.

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

Contact: see homepage or GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

courlan-0.2.1.tar.gz (186.4 kB view details)

Uploaded Source

Built Distribution

courlan-0.2.1-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file courlan-0.2.1.tar.gz.

File metadata

  • Download URL: courlan-0.2.1.tar.gz
  • Upload date:
  • Size: 186.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a32860e72006d66984dcf3f4a9e5ff8ff9926f4bb51023d2d9f0aea3178543fd
MD5 e7d5682b096eef78dfd78c38fd681d23
BLAKE2b-256 fa758a540270c3f486d58a4c8b8c63477133b6f317c328d25c6322b25dbf1e0d

See more details on using hashes here.

File details

Details for the file courlan-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: courlan-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e650345eed6668bb7db870a6da452ef3954eb7a67bc648bd950b73d12c69d0d1
MD5 c64e22f94fb6275fe0879ee459f60006
BLAKE2b-256 5576763b014dbb5c261a0b45117bf32d36325529277c0c42ba58e156f6213309

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page