Clean, filter, normalize, and sample URLs
Project description
Features
Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text.
URL validation and (basic) normalization
Filters targeting spam and unsuitable content-types
Sampling by domain name
Command-line interface (CLI) and Python tool
Let the coURLan fish out juicy bits for you!
Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).
Installation
This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:
$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
Usage
courlan is designed to work best on English, German and most frequent European languages.
The current logic of detailed/strict URL filtering is focused on English and German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.
Python
All operations chained:
>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
# optional argument targeting webpages in English or German
>>> url, domain_name = check_url(my_url, with_redirects=True, language='en')
>>> url, domain_name = check_url(my_url, with_redirects=True, language='de')
Helper function, scrub and normalize:
>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'
Basic scrubbing only:
>>> from courlan import scrub_url
Basic normalization only:
>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'
Basic URL validation only:
>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))
Sampling by domain name:
>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False
Determine if a link leads to another host:
>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True
Command-line
$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
- usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
[–strict] [-l {de,en}] [-r] [–sample] [–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]
- optional arguments:
- -h, --help
show this help message and exit
- I/O:
Manage input and output
- -i INPUTFILE, --inputfile INPUTFILE
name of input file (required)
- -o OUTPUTFILE, --outputfile OUTPUTFILE
name of output file (required)
- -d DISCARDEDFILE, --discardedfile DISCARDEDFILE
name of file to store discarded URLs (optional)
- -v, --verbose
increase output verbosity
- Filtering:
Configure URL filters
- --strict
perform more restrictive tests
- -l, --language
use language filter {de,en}
- -r, --redirects
check redirects
- Sampling:
Use sampling by host, configure sample size
- --sample
use sampling
- --samplesize SAMPLESIZE
size of sample per domain
- --exclude-max EXCLUDE_MAX
exclude domains with more than n URLs
- --exclude-min EXCLUDE_MIN
exclude domains with less than n URLs
Additional scripts
Scripts designed to handle URL lists are found under helpers.
License
coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.
See also GPL and free software licensing: What’s in it for business?
Contributing
Contributions are welcome!
Feel free to file issues on the dedicated page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file courlan-0.3.0.tar.gz
.
File metadata
- Download URL: courlan-0.3.0.tar.gz
- Upload date:
- Size: 189.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 785426268d3f8cc88089cdac287b70ac64c03ec3ce227bb6fc4d41b2c749b4b9 |
|
MD5 | df23dbed69432ad234690fa2cf172113 |
|
BLAKE2b-256 | 4955f3c0da81ac128edab5a83ec7c6c2d8f9ec82603cd46b6174a07d1c067034 |
File details
Details for the file courlan-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: courlan-0.3.0-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.25.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8267058da4b851dcd159d9f49e11c44cb3351b25326708cf405d7ebd0339fea3 |
|
MD5 | 964a78d3f86de269ce07691b1feadeff |
|
BLAKE2b-256 | c4915e9d0a74888bbeb5379275a8008a34d84df83db7dfc905d8f0f9700fecd3 |