Skip to main content

Crawl a website and create a structured map of its pages

Project description

sitewalker

Crawl a website and create a structured map of its pages.

Installation

pipx install sitewalker

Usage

# Map all pages on a site (single-level crawl)
sitewalker example.com

# Recursive crawl of all internal pages
sitewalker example.com -r

# Collect external links
sitewalker example.com -e

# Collect external links and check their HTTP status
sitewalker example.com -e --check-external

# Recursive crawl with external link collection
sitewalker example.com -r -e

# Only crawl web pages (skip images, PDFs, etc.)
sitewalker example.com -r -p

# Crawl an HTTP-only site (e.g., LAN staging server)
sitewalker http://staging.lan --allow-private

# Verbose output for debugging
sitewalker example.com -r -v

The target accepts a bare domain (example.com) or a full URL (http://example.com). Bare domains default to HTTPS — if the connection fails, sitewalker exits with a message to provide the full URL.

Options

Flag Description Default
-r, --recursive Recursively crawl internal links Off
-e, --external-links Collect external links Off
--check-external Check HTTP status of external links (requires -e) Off
-p, --pages Only crawl web pages (HTML, PHP, etc.) Off
-v, --verbose Enable verbose/debug output Off
-t, --timeout Request timeout in seconds 30
--max-pages Maximum number of pages to crawl 1000
--max-depth Maximum link distance from start URL (BFS) 10
--delay Delay between requests in seconds (use 0 for local) 1.0
--allow-private Allow crawling domains that resolve to private IPs Off
--ignore-robots Ignore robots.txt rules Off

Output

Results are saved to a CSV file named {domain}_{timestamp}.csv with columns:

  • URL — the page URL
  • Title — the page's <title> tag content
  • Status Code — HTTP response status

When using -e, external links are additionally saved to {domain}_{timestamp}_external_links.csv. The internal pages CSV is always generated. With --check-external, the external links CSV includes a Status Code column.

Security

  • SSRF protection: Domains that resolve to private/reserved IP addresses are blocked by default. Use --allow-private to override for legitimate internal use.
  • robots.txt: Respected by default. Use --ignore-robots to override.
  • CSV injection: Output values are sanitized to prevent spreadsheet formula injection.
  • Crawl limits: Recursive crawls are bounded by --max-pages and --max-depth to prevent resource exhaustion.

Roadmap

  • --format json — JSON output format
  • --images --check-alt — image inventory with alt text auditing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitewalker-0.3.1.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitewalker-0.3.1-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file sitewalker-0.3.1.tar.gz.

File metadata

  • Download URL: sitewalker-0.3.1.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.3.1.tar.gz
Algorithm Hash digest
SHA256 1762e1f7df67c56a41b0b00efbad0348f3bd8839dd8d24475016c3b8c12e050d
MD5 7b341292a7626d2bd422aa17776342fa
BLAKE2b-256 2c583c036e830cdc7c461a101e6817cf3c35806217df5f3e6f38d9b9811dacc4

See more details on using hashes here.

File details

Details for the file sitewalker-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: sitewalker-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5ec2d9566d69328f4178278b360cdd3d2dd5605cf16af64955d2259ed50e8791
MD5 bdd27539aa6c3708ae5287ed60806c1d
BLAKE2b-256 ababce5eb8117b65bc697cb5aab00918c9d7f5d8aa24fe3f8b854fdcf7d5b80c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page