Simple Python3 module to crawl a website and extract URLs
Project description
Simple Python module to crawl a website and extract URLs.
Installation
Using pip:
pip3 install sitecrawl
sitecrawl --help
Or build from sources:
# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl
# Installation
pip3 install .
Usage
CLI
sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose
->
* Found 4 internal URLs https://www.yahoo.com https://www.yahoo.com/entertainment https://www.yahoo.com/lifestyle https://www.yahoo.com/plus * Found 5 external URLs https://mail.yahoo.com/ https://news.yahoo.com/ https://finance.yahoo.com/ https://sports.yahoo.com/ https://shopping.yahoo.com/ * Skipped 0 URLs
As a module
Basic example:
from sitecrawl import crawl
crawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)
print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())
A more detailed example is available in example.py.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sitecrawl-1.0.5.tar.gz
(5.4 kB
view details)
Built Distribution
File details
Details for the file sitecrawl-1.0.5.tar.gz
.
File metadata
- Download URL: sitecrawl-1.0.5.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 203417c73038f3beb7ad185985ffe7fe0aef2c6d0b88a6bfbcab886e4350eb07 |
|
MD5 | 7b46609d564f7aafe48d19210b407364 |
|
BLAKE2b-256 | 99d6cf003181dc0a933c51e8274ca964adf7d6a63b2cbe1c2c42819e4aaf0d5e |
File details
Details for the file sitecrawl-1.0.5-py2.py3-none-any.whl
.
File metadata
- Download URL: sitecrawl-1.0.5-py2.py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39d8e219c7e75252395ef91666626c8c4fb72657f632481af3f4e6db019f1e60 |
|
MD5 | 3afd6f6ebf96483946368b2bac7ca4f3 |
|
BLAKE2b-256 | aaf556d45dffb05dd630a0dc062859088e99f0178e3c39bcd3332814d9ff6523 |