Skip to main content

Crawl a website and create a structured map of its pages

Project description

sitewalker

Crawl a website and create a structured map of its pages.

Installation

pipx install sitewalker

Usage

# Map all pages on a site (single-level crawl)
sitewalker example.com

# Recursive crawl of all internal pages
sitewalker example.com -r

# Collect external links
sitewalker example.com -e

# Recursive crawl with external link collection
sitewalker example.com -r -e

# Only crawl web pages (skip images, PDFs, etc.)
sitewalker example.com -r -p

# Crawl an HTTP-only site (e.g., LAN staging server)
sitewalker http://staging.lan --allow-private

# Verbose output for debugging
sitewalker example.com -r -v

The target accepts a bare domain (example.com) or a full URL (http://example.com). Bare domains default to HTTPS — if the connection fails, sitewalker exits with a message to provide the full URL.

Options

Flag Description Default
-r, --recursive Recursively crawl internal links Off
-e, --external-links Collect external links Off
-p, --pages Only crawl web pages (HTML, PHP, etc.) Off
-v, --verbose Enable verbose/debug output Off
-t, --timeout Request timeout in seconds 30
--max-pages Maximum number of pages to crawl 1000
--max-depth Maximum crawl depth for recursive mode 10
--allow-private Allow crawling domains that resolve to private IPs Off
--ignore-robots Ignore robots.txt rules Off

Output

Results are saved to a CSV file named {domain}_{timestamp}.csv with columns:

  • URL — the page URL
  • Title — the page's <title> tag content
  • Status Code — HTTP response status

When using -e, external links are saved to a separate {domain}_{timestamp}_external_links.csv.

Security

  • SSRF protection: Domains that resolve to private/reserved IP addresses are blocked by default. Use --allow-private to override for legitimate internal use.
  • robots.txt: Respected by default. Use --ignore-robots to override.
  • CSV injection: Output values are sanitized to prevent spreadsheet formula injection.
  • Crawl limits: Recursive crawls are bounded by --max-pages and --max-depth to prevent resource exhaustion.

Roadmap

  • --format json — JSON output format
  • --check-links — broken link detection
  • --images --check-alt — image inventory with alt text auditing

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitewalker-0.2.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sitewalker-0.2.0-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file sitewalker-0.2.0.tar.gz.

File metadata

  • Download URL: sitewalker-0.2.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.2.0.tar.gz
Algorithm Hash digest
SHA256 934d571825bf40741fc8b80bf19353ddad08d3503b0436320013f2f879bdc94d
MD5 f70c8b8623898532a80f39d3b642772b
BLAKE2b-256 891d361cf8cf55d9e0921e1eb768a808ddb3f5c26e59509f591f4b4d285a10a2

See more details on using hashes here.

File details

Details for the file sitewalker-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: sitewalker-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.3 Linux/6.8.0-107-generic

File hashes

Hashes for sitewalker-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b56b5ba4ffb11d53778deb96e5a169e005b94b91965152a5d7180504077b5a5
MD5 f0362c05ba92ad1cf5999ea39f4f8981
BLAKE2b-256 39d2d9ad048a976cadfab66a2bf0c97b97dcfed6baa02a3239c84910cdba99ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page