Skip to main content

Clean, filter and sample URLs to optimize data collection. Includes spam, content type and language filters.

Project description

Python package Python versions Code Coverage

Why coURLan?

“Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” (Edwards et al. 2001)

Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort. This library provides an additional brain for web crawling, scraping and management of Internet archives. Specific fonctionality for crawlers: stay away from pages with little text content or target synoptic pages explicitly to gather links.

This navigation help targets text-based documents (i.e. currently web pages expected to be in HTML format) and tries to guess the language of pages to allow for language-focused collection. Additional functions include straightforward domain name extraction and URL sampling.

Features

Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text. Most helpers revolve around the strict and language arguments:

  • Heuristics for triage of links
    • Targeting spam and unsuitable content-types

    • Language-aware filtering

    • Crawl management

  • URL handling
    • Validation

    • Canonicalization/Normalization

    • Sampling

  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.5 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Python

check_url()

All useful operations chained in check_url(url):

>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)

Language-aware heuristics, notably internationalization in URLs, are available in lang_filter(url, language):

# optional argument targeting webpages in English or German
>>> url = 'https://www.un.org/en/about-us'
# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')
# failure: doesn't return anything
>>> check_url(url, language='de')
>>>
# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
('https://en.wikipedia.org', 'wikipedia.org')
>>> check_url(url, language='de', strict=True)
>>>

Define stricter restrictions on the expected content type with strict=True. Also blocks certain platforms and pages types crawlers should stay away from if they don’t target them explicitly and other black holes where machines get lost.

# strict filtering
>>> check_url('https://www.twitch.com/', strict=True)
# blocked as it is a major platform

Sampling by domain name

>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Web crawling and URL handling

Determine if a link leads to another host:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Other useful functions dedicated to URL handling:

  • get_base_url(url): strip the URL of some of its parts

  • get_host_and_path(url): decompose URLs in two parts: protocol + host/domain and path

  • get_hostinfo(url): extract domain and host info (protocol + host/domain)

  • fix_relative_urls(baseurl, url): prepend necessary information to relative links

>>> from courlan import *
>>> url = 'https://www.un.org/en/about-us'
>>> get_base_url(url)
'https://www.un.org'
>>> get_host_and_path(url)
('https://www.un.org', '/en/about-us')
>>> get_hostinfo(url)
('un.org', 'https://www.un.org')
>>> fix_relative_urls('https://www.un.org', 'en/about-us')
'https://www.un.org/en/about-us'

Other filters dedicated to crawl frontier management:

  • is_not_crawlable(url): check for deep web or pages generally not usable in a crawling context

  • is_navigation_page(url): check for navigation and overview pages

>>> from courlan import is_navigation_page, is_not_crawlable
>>> is_navigation_page('https://www.randomblog.net/category/myposts')
True
>>> is_not_crawlable('https://www.randomblog.net/login')
True

Python helpers

Helper function, scrub and normalize:

>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Basic scrubbing only:

>>> from courlan import scrub_url

Basic canonicalization/normalization only, i.e. modifying and standardizing URLs in a consistent manner:

>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Command-line

The main fonctions are also available through a command-line utility.

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
               [--strict] [-l LANGUAGE] [-r] [--sample]
               [--samplesize SAMPLESIZE] [--exclude-max EXCLUDE_MAX]
               [--exclude-min EXCLUDE_MIN]
optional arguments:
-h, --help

show this help message and exit

I/O:

Manage input and output

-i INPUTFILE, --inputfile INPUTFILE

name of input file (required)

-o OUTPUTFILE, --outputfile OUTPUTFILE

name of output file (required)

-d DISCARDEDFILE, --discardedfile DISCARDEDFILE

name of file to store discarded URLs (optional)

-v, --verbose

increase output verbosity

Filtering:

Configure URL filters

--strict

perform more restrictive tests

-l LANGUAGE, --language LANGUAGE

use language filter (ISO 639-1 code)

-r, --redirects

check redirects

Sampling:

Use sampling by host, configure sample size

--sample

use sampling

--samplesize SAMPLESIZE

size of sample per domain

--exclude-max EXCLUDE_MAX

exclude domains with more than n URLs

--exclude-min EXCLUDE_MIN

exclude domains with less than n URLs

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Settings

courlan is optimized for English and German but its generic approach is also usable in other contexts.

To review details of strict URL filtering see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.

Contact: see homepage or GitHub.

Software ecosystem: see this graphic.

Similar work

These Python libraries perform similar normalization tasks but don’t entail language or content filters. They also don’t necessarily focus on crawl optimization:

References

  • Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer networks and ISDN systems, 30(1-7), 161–172.

  • Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). “An adaptive model for optimizing performance of an incremental web crawler”. In Proceedings of the 10th international conference on World Wide Web - WWW ‘01. pp. 106–113.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

courlan-0.6.0.tar.gz (202.3 kB view details)

Uploaded Source

Built Distribution

courlan-0.6.0-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file courlan-0.6.0.tar.gz.

File metadata

  • Download URL: courlan-0.6.0.tar.gz
  • Upload date:
  • Size: 202.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.6.0.tar.gz
Algorithm Hash digest
SHA256 30dd02243951688275768b1025bccfd88396669210d7e435fa89960fe6c62faf
MD5 7e05f053dfcf380039a2aa87b1e94f6d
BLAKE2b-256 df0c8422d5a8f7966b28f35dce1e44a9b4407af2c15629ecf2301271b9d13262

See more details on using hashes here.

File details

Details for the file courlan-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: courlan-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 26.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.25.1 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2922aea7635d6a177d42ac93a3087d254c81fdc3b56178164bd933c8e3f061ab
MD5 e759ada40f3bfb95360b291268d77ca0
BLAKE2b-256 c8e6a42807923645ffaf6d15f1bb46003ac0fafa8c369647761de3b42de6c465

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page