Simple Python3 module to crawl a website and extract URLs
Project description
Simple Python module to crawl a website and extract URLs.
Installation
Using pip:
pip3 install sitecrawl
sitecrawl --help
Or build from sources:
# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl
# Installation
pip3 install .
Usage
CLI
sitecrawl --url http://www.gab.lc --depth 3
# Add --verbose for verbose mode
->
* Found 4 internal URLs http://www.gab.lc http://www.gab.lc/articles http://www.gab.lc/contact http://www.gab.lc/about * Found 8 external URLs https://gpgtools.org/ http://en.wikipedia.org/wiki/GNU_General_Public_License http://en.wikipedia.org/wiki/Pretty_Good_Privacy http://en.wikipedia.org/wiki/GNU_Privacy_Guard https://www.gpgtools.org https://www.google.com/#hl=en&q=install+gpg+windows http://www.gnupg.org/gph/en/manual/x135.html http://keys.gnupg.net * Skipped 0 URLs
As a module
Basic example:
from sitecrawl import crawl
crawl.base_url = 'https://www.github.com'
crawl.deep_crawl(depth=2)
print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())
A more detailed example is available in example.py.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sitecrawl-1.0.2.tar.gz
(5.2 kB
view details)
Built Distribution
File details
Details for the file sitecrawl-1.0.2.tar.gz
.
File metadata
- Download URL: sitecrawl-1.0.2.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65b75e9304fab6930ca0e0b3221d28c3351c08f00cefa7d99311fa56f3ebd5db |
|
MD5 | a8baea517ba4b0ce873231aa115d2403 |
|
BLAKE2b-256 | 5c827ee5e2b70cd5c044f5ecafbb21d8209155f910a39c2c08052e22cd9ead7a |
File details
Details for the file sitecrawl-1.0.2-py2.py3-none-any.whl
.
File metadata
- Download URL: sitecrawl-1.0.2-py2.py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad6a7ba46da29d73c21341529a1694d6f1c6c169c215e76d8d72caf0a0ee42f5 |
|
MD5 | 85ef2fe8094e562085a15edce9c24c9f |
|
BLAKE2b-256 | c21db5679d4834db473e60c46cf7ad85c2be369b8845f9581bdb4326a016cdf5 |