Clean, filter and sample URLs
Project description
Features
Cleaning and filtering targeting non-spam HTML pages with primarily text
URL validation
Sampling by domain name
Command-line interface (CLI) and Python tool
Let the coURLan fish out juicy bits for you!
Here is a courlan (source: Limpkin at Harn’s Marsh by Russ, CC BY 2.0).
Installation
This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:
$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)
Usage
Current focus is on German, for more see settings.py. This can be overriden by cloning the repository and recompiling the package locally.
Command-line
$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
- usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-v] [-l] [-r] [-s]
[–samplesize SAMPLESIZE] [–exclude-max EXCLUDE_MAX] [–exclude-min EXCLUDE_MIN]
- optional arguments:
- -h, --help
show this help message and exit
- -i INPUTFILE, --inputfile INPUTFILE
name of input file
- -o OUTPUTFILE, --outputfile OUTPUTFILE
name of input file
- -v, --verbose
increase output verbosity
- -l, --language
use language filter
- -r, --redirects
check redirects
- -s, --sample
use sampling
- --samplesize SAMPLESIZE
size of sample per domain
- --exclude-max EXCLUDE_MAX
exclude domains with more than n URLs
- --exclude-min EXCLUDE_MIN
exclude domains with less than n URLs
Python
All operations chained:
>>> from courlan.core import check_url
>>> url, domain_name = check_url(my_url)
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)
Cleaning only:
>>> from courlan.clean import clean_url
>>> my_url = clean_url(my_url)
URL validation:
>>> from courlan.filters import validate_url
>>> result, parsed_url = validate_url(my_url)
Sampling by domain name:
>>> from courlan.core import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, verbose=False
Additional scripts
Scripts designed to handle URL lists are found under helpers.
License
coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.
See also GPL and free software licensing: What’s in it for business?
Contributing
Contributions are welcome!
Feel free to file issues on the dedicated page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file courlan-0.1.0.tar.gz
.
File metadata
- Download URL: courlan-0.1.0.tar.gz
- Upload date:
- Size: 182.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7def9436fd539ac0479a5e5087861564ac3996ef865a37bc77b6e6db7bd10cbe |
|
MD5 | f3c50f49d292b785ac51d119983e42ec |
|
BLAKE2b-256 | d8ec039c4ad49a85abc9173bb013553d4f5c5de36c9298a402a4159f16afac22 |
File details
Details for the file courlan-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: courlan-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd50c89b1c6689087aace7badae2c2537ca522671acdd306fa1f2a9dcc02d8da |
|
MD5 | 0938af766136796dca24c1a839767428 |
|
BLAKE2b-256 | ffea35c6a9bdcec5a9f701460a499e6cce2a003e7adeb585d0c2d250c2db8094 |