A Python library for crawling a website and returning a sitemap.
Project description
📖 About
Sitemappy (or sitemap-py 😉) is a crawler that produces a sitemap for a given website.
Sitemappy is a command-line application, and also provides Python interfaces for use as a library.
Features
- Print the URL for a given website when visited
- Print the links for a given webpage
- Visit the links for a given webpage
- Limit the links to follow on a webpage to the same single subdomain
- Concurrency (
asyncio,multithreading,multiprocessing) - Output crawling results to file by default (results too long for console)
- Modify number of async crawler workers
- Specify crawling depth
- Crawling politeness argument
- Follow HTTP redirect responses
- HTTP error response handling
- Add DEBUG, INFO and ERROR logging
- Adhere to a website's
robots.txt - "Spider Trap" resilience
- Introduce
multiprocessing - Distributed multiprocessing
- Publish to PyPi 🚀
- GitHub Workflows (deploy)
- GitHub Workflows (linting, unit testing, dev deployments)
🚀 Usage
Generate a sitemap (./results.json):
sitemappy-cli https://monzo.com/
Help
$ sitemappy-cli --help
usage: sitemappy-cli [-h] BASE_URL
Sitemappy is a CLI tool to crawl a website and create a sitemap.
For more information about the tool go to https://github.com/dan-wilton/sitemappy/
Arguments:
BASE_URL a valid website URL to sitemap [required]
Options:
--workers INTEGER Number of workers to asynchronously
make web requests [default: 10]
--crawl-depth INTEGER Depth of links from base URL to follow
[default: 0 - unlimited]
--politeness-delay INTEGER Delay between each request to the website
[default: 0 - none]
--enable-cmd-out Print output to cmd
--help show this help message and exit
🎒 Requirements
Python 3.12+
Development
💻 Installation
To use the sitemappy CLI:
pip install --user -U sitemappy-cli
Local Development / Contributing
pdm install
Python Library
Use sitemappy in your project with one of the following:
with pip:
pip install -U sitemappy-cli
with PDM:
pdm add sitemappy-cli
with Poetry >= 1.2.0:
poetry add sitemappy-cli
macOS
NOTE: This is not yet enabled 😢
via homebrew:
brew install sitemappy-cli
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitemappy_cli-1.0.3.tar.gz.
File metadata
- Download URL: sitemappy_cli-1.0.3.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.15.4 CPython/3.10.12 Linux/6.5.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a18e6948c7cf6ff95cd50398110929667b7f85b7382e943a9fde71581c85a4cd
|
|
| MD5 |
3ceba0dff9058fe953b4bb154d72178b
|
|
| BLAKE2b-256 |
c8068a71c47b19725debaf1a3c4b9407fc6f4ea83965feab9e5bcd14ddf68d7e
|
File details
Details for the file sitemappy_cli-1.0.3-py3-none-any.whl.
File metadata
- Download URL: sitemappy_cli-1.0.3-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.15.4 CPython/3.10.12 Linux/6.5.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1c03fbf1debecb55cd39c60b75ce0424b1a96d754b0442313bc8ba409beab69
|
|
| MD5 |
94fc1196b2fde01e02dccbf0839f6b84
|
|
| BLAKE2b-256 |
fa6878183614c10ac9f51fec4ebc6607de1b2c7d0d36c240ad49d9327a22e1b3
|