Crawl a website and create a structured map of its pages
Project description
sitewalker
Crawl a website and create a structured map of its pages.
Installation
pipx install sitewalker
Usage
# Map all pages on a site (single-level crawl)
sitewalker example.com
# Recursive crawl of all internal pages
sitewalker example.com -r
# Collect external links
sitewalker example.com -e
# Collect external links and check their HTTP status
sitewalker example.com -e --check-external
# Recursive crawl with external link collection
sitewalker example.com -r -e
# Only crawl web pages (skip images, PDFs, etc.)
sitewalker example.com -r -p
# Crawl an HTTP-only site (e.g., LAN staging server)
sitewalker http://staging.lan --allow-private
# Verbose output for debugging
sitewalker example.com -r -v
The target accepts a bare domain (example.com) or a full URL (http://example.com). Bare domains default to HTTPS — if the connection fails, sitewalker exits with a message to provide the full URL.
Options
| Flag | Description | Default |
|---|---|---|
-r, --recursive |
Recursively crawl internal links | Off |
-e, --external-links |
Collect external links | Off |
--check-external |
Check HTTP status of external links (requires -e) |
Off |
-p, --pages |
Only crawl web pages (HTML, PHP, etc.) | Off |
-v, --verbose |
Enable verbose/debug output | Off |
-t, --timeout |
Request timeout in seconds | 30 |
--max-pages |
Maximum number of pages to crawl | 1000 |
--max-depth |
Maximum link distance from start URL (BFS) | 10 |
--delay |
Delay between requests in seconds (use 0 for local) | 1.0 |
--allow-private |
Allow crawling domains that resolve to private IPs | Off |
--ignore-robots |
Ignore robots.txt rules | Off |
Output
Results are saved to a CSV file named {domain}_{timestamp}.csv with columns:
- URL — the page URL
- Title — the page's
<title>tag content - Status Code — HTTP response status
When using -e, external links are additionally saved to {domain}_{timestamp}_external_links.csv. The internal pages CSV is always generated. With --check-external, the external links CSV includes a Status Code column.
Security
- SSRF protection: Domains that resolve to private/reserved IP addresses are blocked by default. Use
--allow-privateto override for legitimate internal use. - robots.txt: Respected by default. Use
--ignore-robotsto override. - CSV injection: Output values are sanitized to prevent spreadsheet formula injection.
- Crawl limits: Recursive crawls are bounded by
--max-pagesand--max-depthto prevent resource exhaustion.
Roadmap
--format json— JSON output format--images --check-alt— image inventory with alt text auditing
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sitewalker-0.3.1.tar.gz.
File metadata
- Download URL: sitewalker-0.3.1.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1762e1f7df67c56a41b0b00efbad0348f3bd8839dd8d24475016c3b8c12e050d
|
|
| MD5 |
7b341292a7626d2bd422aa17776342fa
|
|
| BLAKE2b-256 |
2c583c036e830cdc7c461a101e6817cf3c35806217df5f3e6f38d9b9811dacc4
|
File details
Details for the file sitewalker-0.3.1-py3-none-any.whl.
File metadata
- Download URL: sitewalker-0.3.1-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ec2d9566d69328f4178278b360cdd3d2dd5605cf16af64955d2259ed50e8791
|
|
| MD5 |
bdd27539aa6c3708ae5287ed60806c1d
|
|
| BLAKE2b-256 |
ababce5eb8117b65bc697cb5aab00918c9d7f5d8aa24fe3f8b854fdcf7d5b80c
|