Simple Python3 module to crawl a website and extract URLs
Project description
Simple Python module to crawl a website and extract URLs.
Installation
Using pip:
pip3 install sitecrawl
sitecrawl --help
Or build from sources:
# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl
# Installation
pip3 install .
Usage
CLI
sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose
->
* Found 4 internal URLs https://www.yahoo.com https://www.yahoo.com/entertainment https://www.yahoo.com/lifestyle https://www.yahoo.com/plus * Found 5 external URLs https://mail.yahoo.com/ https://news.yahoo.com/ https://finance.yahoo.com/ https://sports.yahoo.com/ https://shopping.yahoo.com/ * Skipped 0 URLs
As a module
Basic example:
from sitecrawl import crawl
crawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)
print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())
A more detailed example is available in example.py.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sitecrawl-1.0.5.tar.gz
(5.4 kB
view hashes)
Built Distribution
Close
Hashes for sitecrawl-1.0.5-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39d8e219c7e75252395ef91666626c8c4fb72657f632481af3f4e6db019f1e60 |
|
MD5 | 3afd6f6ebf96483946368b2bac7ca4f3 |
|
BLAKE2b-256 | aaf556d45dffb05dd630a0dc062859088e99f0178e3c39bcd3332814d9ff6523 |