Skip to main content

Clean, filter and sample URLs

Project description

Python package Python versions Travis build status Code Coverage

Features

  • Cleaning and filtering targeting non-spam HTML pages with primarily text

  • URL validation

  • Sampling by domain name

  • Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Courlan

Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Usage

Current focus is on German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Command-line

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-v] [-l] [-r] [-s]

[–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]

optional arguments:
-h, --help

show this help message and exit

-i INPUTFILE, --inputfile INPUTFILE

name of input file

-o OUTPUTFILE, --outputfile OUTPUTFILE

name of input file

-v, --verbose

increase output verbosity

-l, --language

use language filter

-r, --redirects

check redirects

-s, --sample

use sampling

--samplesize SAMPLESIZE

size of sample per domain

--exclude-max EXCLUDE_MAX

exclude domains with more than n URLs

--exclude-min EXCLUDE_MIN

exclude domains with less than n URLs

Python

All operations chained:

>>> from courlan.core import check_url
>>> url, domain_name = check_url(my_url)
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)

Cleaning only:

>>> from courlan.clean import clean_url
>>> my_url = clean_url(my_url)

URL validation:

>>> from courlan.filters import validate_url
>>> result, parsed_url = validate_url(my_url)

Sampling by domain name:

>>> from courlan.core import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, verbose=False

Additional scripts

Scripts designed to handle URL lists are found under helpers.

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What’s in it for business?

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

Contact: see homepage or GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

courlan-0.1.0.tar.gz (182.6 kB view details)

Uploaded Source

Built Distribution

courlan-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file courlan-0.1.0.tar.gz.

File metadata

  • Download URL: courlan-0.1.0.tar.gz
  • Upload date:
  • Size: 182.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7def9436fd539ac0479a5e5087861564ac3996ef865a37bc77b6e6db7bd10cbe
MD5 f3c50f49d292b785ac51d119983e42ec
BLAKE2b-256 d8ec039c4ad49a85abc9173bb013553d4f5c5de36c9298a402a4159f16afac22

See more details on using hashes here.

File details

Details for the file courlan-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: courlan-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for courlan-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd50c89b1c6689087aace7badae2c2537ca522671acdd306fa1f2a9dcc02d8da
MD5 0938af766136796dca24c1a839767428
BLAKE2b-256 ffea35c6a9bdcec5a9f701460a499e6cce2a003e7adeb585d0c2d250c2db8094

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page