Clean, filter, normalize, and sample URLs
Project description
Why coURLan?
Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort. This library provides an additional brain for web crawling, scraping and management of Internet archives. Specific fonctionality for crawlers: stay away from pages with little text content or target synoptic pages explicitly to gather links.
This navigation help targets text-based documents (i.e. currently web pages expected to be in HTML format) and tries to guess the language of pages to allow for language-focused collection. Additional functions include straightforward domain name extraction and URL sampling.
Features
Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text. Most helpers revolve around the strict and language arguments:
- Heuristics for triage of links
Targeting spam and unsuitable content-types
Language-aware filtering
Crawl management
- URL handling
Validation
Canonicalization/Normalization
Sampling
Command-line interface (CLI) and Python tool
Let the coURLan fish out juicy bits for you!
Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).
Installation
This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.5 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:
$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
Python
check_url()
All useful operations chained in check_url(url):
>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
Language-aware heuristics, notably internationalization in URLs, are available in lang_filter(url, language):
# optional argument targeting webpages in English or German
>>> url = 'https://www.un.org/en/about-us'
# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')
# failure: doesn't return anything
>>> check_url(url, language='de')
>>>
# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
('https://en.wikipedia.org', 'wikipedia.org')
>>> check_url(url, language='de', strict=True)
>>>
Define stricter restrictions on the expected content type with strict=True. Also blocks certain platforms and pages types crawlers should stay away from if they don’t target them explicitly and other black holes where machines get lost.
# strict filtering
>>> check_url('https://www.twitch.com/', strict=True)
# blocked as it is a major platform
Sampling by domain name
>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False
Web crawling and URL handling
Determine if a link leads to another host:
>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True
Other useful functions dedicated to URL handling:
get_base_url(url): strip the URL of some of its parts
get_host_and_path(url): decompose URLs in two parts: protocol + host/domain and path
get_hostinfo(url): extract domain and host info (protocol + host/domain)
fix_relative_urls(baseurl, url): prepend necessary information to relative links
>>> from courlan import *
>>> url = 'https://www.un.org/en/about-us'
>>> get_base_url(url)
'https://www.un.org'
>>> get_host_and_path(url)
('https://www.un.org', '/en/about-us')
>>> get_hostinfo(url)
('un.org', 'https://www.un.org')
>>> fix_relative_urls('https://www.un.org', 'en/about-us')
'https://www.un.org/en/about-us'
Other filters dedicated to crawl frontier management:
is_not_crawlable(url): check for deep web or pages generally not usable in a crawling context
is_navigation_page(url): check for navigation and overview pages
>>> from courlan import is_navigation_page, is_not_crawlable
>>> is_navigation_page('https://www.randomblog.net/category/myposts')
True
>>> is_not_crawlable('https://www.randomblog.net/login')
True
Python helpers
Helper function, scrub and normalize:
>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'
Basic scrubbing only:
>>> from courlan import scrub_url
Basic canonicalization/normalization only:
>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'
Basic URL validation only:
>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))
Command-line
The main fonctions are also available through a command-line utility.
$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
[--strict] [-l LANGUAGE] [-r] [--sample]
[--samplesize SAMPLESIZE] [--exclude-max EXCLUDE_MAX]
[--exclude-min EXCLUDE_MIN]
- optional arguments:
- -h, --help
show this help message and exit
- I/O:
Manage input and output
- -i INPUTFILE, --inputfile INPUTFILE
name of input file (required)
- -o OUTPUTFILE, --outputfile OUTPUTFILE
name of output file (required)
- -d DISCARDEDFILE, --discardedfile DISCARDEDFILE
name of file to store discarded URLs (optional)
- -v, --verbose
increase output verbosity
- Filtering:
Configure URL filters
- --strict
perform more restrictive tests
- -l LANGUAGE, --language LANGUAGE
use language filter (ISO 639-1 code)
- -r, --redirects
check redirects
- Sampling:
Use sampling by host, configure sample size
- --sample
use sampling
- --samplesize SAMPLESIZE
size of sample per domain
- --exclude-max EXCLUDE_MAX
exclude domains with more than n URLs
- --exclude-min EXCLUDE_MIN
exclude domains with less than n URLs
License
coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.
See also GPL and free software licensing: What’s in it for business?
Settings
courlan is optimized for English and German but its generic approach is also usable in other contexts.
To review details of strict URL filtering see settings.py. This can be overriden by cloning the repository and recompiling the package locally.
Contributing
Contributions are welcome!
Feel free to file issues on the dedicated page.
Similar work
These Python libraries perform similar normalization tasks but don’t entail language or content filters. They also don’t necessarily focus on crawl optimization:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file courlan-0.5.0.tar.gz
.
File metadata
- Download URL: courlan-0.5.0.tar.gz
- Upload date:
- Size: 201.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 667655f91d864d8d0935e1b8788117cb4e1a44c4be3f24c843ae7d99d88a7d99 |
|
MD5 | 3eb15dc629ac3e8cdd6970c4e467229d |
|
BLAKE2b-256 | ec3472be3931c90c4afc869b03c8c516b23d51c16c1c963ddf2d6b15d83dddfc |
File details
Details for the file courlan-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: courlan-0.5.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a22f51da23f4d8e356f62334e5628314197d79104400d8647fdca313ff4bd0fd |
|
MD5 | bddefc6edbaa74a38b57a6d189f898ce |
|
BLAKE2b-256 | 36a25c7d976ece59fddf05b1871782b063d86fac353c4005d66e43613c0a772a |