A Python library for crawling a website and returning a sitemap.
Project description
📖 About
Sitemappy (or sitemap-py 😉) is a crawler that produces a sitemap for a given website.
Sitemappy can be used as a command-line application, and also provides Python interfaces for use as a library.
Features
- Print the URL for a given website when visited
- Print the links for a given webpage
- Visit the links for a given webpage
- Limit the links to follow on a webpage to the same single subdomain
- Concurrency (
asyncio,multithreading,multiprocessing) - Output crawling results to file by default (results too long for console)
- Modify number of async crawler workers
- Specify crawling depth
- Crawling politeness argument
- Follow HTTP redirect responses
- HTTP error response handling
- Adhere to a website's
robots.txt - "Spider Trap" resilience
- Introduce
multiprocessing - Distributed multiprocessing
- Publish to PyPi 🚀
🚀 Usage
Generate a sitemap (./results.json):
sitemappy https://monzo.com/
Help
$ sitemappy --help
usage: sitemappy [-h] BASE_URL
Sitemappy is a CLI tool to crawl a website and create a sitemap.
For more information about the tool go to https://github.com/dan-wilton/sitemappy/
Arguments:
BASE_URL a valid website URL to sitemap [required]
Options:
--workers INTEGER Number of workers to asynchronously
make web requests [default: 10]
--crawl-depth INTEGER Depth of links from base URL to follow
[default: 0 - unlimited]
--politeness-delay INTEGER Delay between each request to the website
[default: 0 - none]
--enable-cmd-out Print output to cmd
--help show this help message and exit
🎒 Requirements
Python 3.12+
Development
💻 Installation
To use the sitemappy CLI:
pip install --user -U sitemappy
Local Development / Contributing
pdm install
Python Library
Use sitemappy in your project with one of the following:
with pip:
pip install -U sitemappy
with PDM:
pdm add sitemappy
with Poetry >= 1.2.0:
poetry add sitemappy
macOS
NOTE: This is not yet enabled 😢
via homebrew:
brew install sitemappy
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitemappy_cli-0.2.0.tar.gz.
File metadata
- Download URL: sitemappy_cli-0.2.0.tar.gz
- Upload date:
- Size: 9.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.15.3 CPython/3.12.3 Darwin/23.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c02a6ad17f1e316a7f39e6d247c612d556b5eca81c66b5f6f21f7889b659331f
|
|
| MD5 |
115e5a2f677e0f09feac53c19c297f02
|
|
| BLAKE2b-256 |
40994a6efdf505bcdec5ddf59e9f567c63953f48a5f63c5fef3141db95e7e6dc
|
File details
Details for the file sitemappy_cli-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sitemappy_cli-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.15.3 CPython/3.12.3 Darwin/23.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68594ff7f467fa5c49bba5ac80b8d4a6bfb12a5eaff1eb062c06ab737593e132
|
|
| MD5 |
112a40b99110f13329cef8ff06e4eca5
|
|
| BLAKE2b-256 |
433410fa6b43602d3e70d73ebb44a41d952fd26312051d8ad59b66eccdef1bbd
|