Skip to main content

Crawl a website and create a structured map of its pages

Project description

sitewalker

Crawl a website and create a structured map of its pages.

Installation

pipx install sitewalker

Usage

# Map all pages on a site (single-level crawl)
sitewalker example.com

# Recursive crawl of all internal pages
sitewalker example.com -r

# Collect external links
sitewalker example.com -e

# Recursive crawl with external link collection
sitewalker example.com -r -e

# Only crawl web pages (skip images, PDFs, etc.)
sitewalker example.com -r -p

# Crawl an HTTP-only site (e.g., LAN staging server)
sitewalker http://staging.lan --allow-private

# Verbose output for debugging
sitewalker example.com -r -v

The target accepts a bare domain (example.com) or a full URL (http://example.com). Bare domains default to HTTPS — if the connection fails, sitewalker exits with a message to provide the full URL.

Options

Flag Description Default
-r, --recursive Recursively crawl internal links Off
-e, --external-links Collect external links Off
-p, --pages Only crawl web pages (HTML, PHP, etc.) Off
-v, --verbose Enable verbose/debug output Off
-t, --timeout Request timeout in seconds 30
--max-pages Maximum number of pages to crawl 1000
--max-depth Maximum crawl depth for recursive mode 10
--allow-private Allow crawling domains that resolve to private IPs Off
--ignore-robots Ignore robots.txt rules Off

Output

Results are saved to a CSV file named {domain}_{timestamp}.csv with columns:

  • URL — the page URL
  • Title — the page's <title> tag content
  • Status Code — HTTP response status

When using -e, external links are saved to a separate {domain}_{timestamp}_external_links.csv.

Security

  • SSRF protection: Domains that resolve to private/reserved IP addresses are blocked by default. Use --allow-private to override for legitimate internal use.
  • robots.txt: Respected by default. Use --ignore-robots to override.
  • CSV injection: Output values are sanitized to prevent spreadsheet formula injection.
  • Crawl limits: Recursive crawls are bounded by --max-pages and --max-depth to prevent resource exhaustion.

Roadmap

  • --format json — JSON output format
  • --check-links — broken link detection
  • --images --check-alt — image inventory with alt text auditing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitewalker-0.2.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitewalker-0.2.1-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file sitewalker-0.2.1.tar.gz.

File metadata

  • Download URL: sitewalker-0.2.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.2.1.tar.gz
Algorithm Hash digest
SHA256 18288db7c53eb4ca6c5709bb8bcfcaa14409ae4d2142cafa908291d06f04bb9d
MD5 3f5bb6d5c29facedf776f5489013e582
BLAKE2b-256 7e589af0f17f35bc19e732bce3238416cd11e39547a12ccad630339052806ec6

See more details on using hashes here.

File details

Details for the file sitewalker-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: sitewalker-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c071958053038fe7d64c3b9849688e645987279483c6fa54280bafe589eaf1ab
MD5 ff92c32c5ba749e0e019f0e9f1e17ae8
BLAKE2b-256 29aa1456a39478cb78c6e531f6ed588a3cd0e2704329f0e19861da7c81c618ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page