Skip to main content

Crawl a website and create a structured map of its pages

Project description

sitewalker

Crawl a website and create a structured map of its pages.

Installation

pipx install sitewalker

Usage

# Map all pages on a site (single-level crawl)
sitewalker example.com

# Recursive crawl of all internal pages
sitewalker example.com -r

# Collect external links
sitewalker example.com -e

# Collect external links and check their HTTP status
sitewalker example.com -e --check-external

# Recursive crawl with external link collection
sitewalker example.com -r -e

# Only crawl web pages (skip images, PDFs, etc.)
sitewalker example.com -r -p

# Crawl an HTTP-only site (e.g., LAN staging server)
sitewalker http://staging.lan --allow-private

# Verbose output for debugging
sitewalker example.com -r -v

The target accepts a bare domain (example.com) or a full URL (http://example.com). Bare domains default to HTTPS — if the connection fails, sitewalker exits with a message to provide the full URL.

Options

Flag Description Default
-r, --recursive Recursively crawl internal links Off
-e, --external-links Collect external links Off
--check-external Check HTTP status of external links (requires -e) Off
-p, --pages Only crawl web pages (HTML, PHP, etc.) Off
-v, --verbose Enable verbose/debug output Off
-t, --timeout Request timeout in seconds 30
--max-pages Maximum number of pages to crawl 1000
--max-depth Maximum link distance from start URL (BFS) 10
--allow-private Allow crawling domains that resolve to private IPs Off
--ignore-robots Ignore robots.txt rules Off

Output

Results are saved to a CSV file named {domain}_{timestamp}.csv with columns:

  • URL — the page URL
  • Title — the page's <title> tag content
  • Status Code — HTTP response status

When using -e, external links are additionally saved to {domain}_{timestamp}_external_links.csv. The internal pages CSV is always generated. With --check-external, the external links CSV includes a Status Code column.

Security

  • SSRF protection: Domains that resolve to private/reserved IP addresses are blocked by default. Use --allow-private to override for legitimate internal use.
  • robots.txt: Respected by default. Use --ignore-robots to override.
  • CSV injection: Output values are sanitized to prevent spreadsheet formula injection.
  • Crawl limits: Recursive crawls are bounded by --max-pages and --max-depth to prevent resource exhaustion.

Roadmap

  • --format json — JSON output format
  • --images --check-alt — image inventory with alt text auditing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitewalker-0.3.0.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitewalker-0.3.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file sitewalker-0.3.0.tar.gz.

File metadata

  • Download URL: sitewalker-0.3.0.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f624dcd3605dfae9f414cec789a6d92b922c0053aadad8662908ae9d63124c32
MD5 953edf3c841b9ab468c0d2364c149df6
BLAKE2b-256 cc25461b5ba8722faae90ec6f2dfc7a34bcae3e236ce86cf4f9d8cc1e3714df4

See more details on using hashes here.

File details

Details for the file sitewalker-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: sitewalker-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b5653725448fba8b7c33beb72f33c02036fe3bd01f4b8992d4d822a530afee36
MD5 01f76e9abf7d5f179cd6db4dc82cbd9c
BLAKE2b-256 1059bbbcd81fa0a92555be0a7485d91ec0585ee542d1165d8885e5ff22ac8e9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page