Skip to main content

Clean, filter and sample URLs

Project description

Python package Python versions Travis build status Code Coverage

Features

  • Cleaning and filtering targeting non-spam HTML pages with primarily text

  • URL validation

  • Sampling by domain name

  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Usage

Current focus is on German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Command-line

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-v] [-l] [-r] [-s]

[–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]

optional arguments:
-h, --help

show this help message and exit

-i INPUTFILE, --inputfile INPUTFILE

name of input file

-o OUTPUTFILE, --outputfile OUTPUTFILE

name of input file

-v, --verbose

increase output verbosity

-l, --language

use language filter

-r, --redirects

check redirects

-s, --sample

use sampling

--samplesize SAMPLESIZE

size of sample per domain

--exclude-max EXCLUDE_MAX

exclude domains with more than n URLs

--exclude-min EXCLUDE_MIN

exclude domains with less than n URLs

Python

All operations chained:

>>> from courlan.core import check_url
>>> url, domain_name = check_url(my_url)
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)

Cleaning only:

>>> from courlan.clean import clean_url
>>> my_url = clean_url(my_url)

URL validation:

>>> from courlan.filters import validate_url
>>> result, parsed_url = validate_url(my_url)

Sampling by domain name:

>>> from courlan.core import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, verbose=False

Additional scripts

Scripts designed to handle URL lists are found under helpers.

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

Contact: see homepage or GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

courlan-0.1.0.tar.gz (182.6 kB view hashes)

Uploaded Source

Built Distribution

courlan-0.1.0-py3-none-any.whl (14.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page