Simple Python3 module to crawl a website and extract URLs
Project description
Simple Python module to crawl a website and extract URLs.
Installation
Using pip:
pip3 install sitecrawl
sitecrawl --help
Or build from sources:
# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl
# Installation
pip3 install .
Usage
CLI
sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose
->
* Found 4 internal URLs https://www.yahoo.com https://www.yahoo.com/entertainment https://www.yahoo.com/lifestyle https://www.yahoo.com/plus * Found 5 external URLs https://mail.yahoo.com/ https://news.yahoo.com/ https://finance.yahoo.com/ https://sports.yahoo.com/ https://shopping.yahoo.com/ * Skipped 0 URLs
As a module
Basic example:
from sitecrawl import crawl
crawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)
print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())
A more detailed example is available in example.py.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sitecrawl-1.0.5.tar.gz
(5.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitecrawl-1.0.5.tar.gz.
File metadata
- Download URL: sitecrawl-1.0.5.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
203417c73038f3beb7ad185985ffe7fe0aef2c6d0b88a6bfbcab886e4350eb07
|
|
| MD5 |
7b46609d564f7aafe48d19210b407364
|
|
| BLAKE2b-256 |
99d6cf003181dc0a933c51e8274ca964adf7d6a63b2cbe1c2c42819e4aaf0d5e
|
File details
Details for the file sitecrawl-1.0.5-py2.py3-none-any.whl.
File metadata
- Download URL: sitecrawl-1.0.5-py2.py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39d8e219c7e75252395ef91666626c8c4fb72657f632481af3f4e6db019f1e60
|
|
| MD5 |
3afd6f6ebf96483946368b2bac7ca4f3
|
|
| BLAKE2b-256 |
aaf556d45dffb05dd630a0dc062859088e99f0178e3c39bcd3332814d9ff6523
|