Skip to main content

CLI tool to fetch URLs from sitemap.xml, check their existence, and generate performance reports

Project description

Siteprobe

Siteprobe is a Rust-based CLI tool that fetches all URLs from a given sitemap.xml url, checks their existence, and generates a performance report. It supports various features such as authentication, concurrency control, caching bypass, and more.

Screenshot of Siteprobe statistics

Features

  • Fetch and parse sitemap.xml to extract URLs, including nested Sitemap Index files recursively.
  • Check the existence and response times of each URL.
  • Generate a detailed performance CSV report.
  • Support for Basic Authentication.
  • Adjustable concurrency limits for request handling.
  • Configurable request timeout settings.
  • Support for configuring rate limits, such as 300 requests per 5-minute interval.
  • Redirect handling with security precautions.
  • Filtering and reporting slow URLs based on a threshold.
  • Custom User-Agent header support.
  • Option to append random timestamps to URLs to bypass caching mechanisms.
  • Save downloaded documents for further inspection or use as a static site mirror.

Installation

Run without installing

uvx siteprobe https://example.com/sitemap.xml
# or
pipx run siteprobe https://example.com/sitemap.xml

Install via package manager

# Homebrew (macOS/Linux)
brew install bartTC/siteprobe/siteprobe

# pip / pipx
pip install siteprobe
pipx install siteprobe

# Cargo
cargo install siteprobe

Build from source

git clone https://github.com/bartTC/siteprobe.git
cd siteprobe
cargo build --release

Usage

siteprobe <sitemap_url> [OPTIONS]

Arguments

  • <sitemap_url> - The URL of the sitemap to be fetched and processed.

Options

Usage: siteprobe [OPTIONS] <SITEMAP_URL>

Arguments:
  <SITEMAP_URL>  The URL of the sitemap to be fetched and processed.

Options:
      --basic-auth <BASIC_AUTH>
          Basic authentication credentials in the format `username:password`
  -H, --header <HEADERS>
          Custom header to include in each request (format: 'Name: Value'). Can
          be specified multiple times.
  -c, --concurrency-limit <CONCURRENCY_LIMIT>
          Maximum number of concurrent requests allowed [default: 4]
  -l, --rate-limit <RATE_LIMIT>
          The rate limit for all requests in the format 'requests/time[unit]',
          where unit can be seconds (`s`), minutes (`m`), or hours (`h`). E.g.
          '-l 300/5m' for 300 requests per 5 minutes, or '-l 100/1h' for 100
          requests per hour.
  -o, --output-dir <OUTPUT_DIR>
          Directory where all downloaded documents will be saved
  -a, --append-timestamp
          Append a random timestamp to each URL to bypass caching mechanisms
  -r, --report-path <REPORT_PATH>
          File path for storing the generated `report.csv`
  -j, --report-path-json <REPORT_PATH_JSON>
          File path for storing the generated `report.json`
      --report-path-html <REPORT_PATH_HTML>
          File path for storing the generated `report.html`
  -t, --request-timeout <REQUEST_TIMEOUT>
          Default timeout (in seconds) for each request [default: 10]
      --user-agent <USER_AGENT>
          Custom User-Agent header to be used in requests [default: "Mozilla/5.0
          (compatible; Siteprobe/1.3.0)"]
      --slow-num <SLOW_NUM>
          Limit the number of slow documents displayed in the report. [default:
          100]
  -s, --slow-threshold <SLOW_THRESHOLD>
          Show slow responses. The value is the threshold (in seconds) for
          considering a document as 'slow'. E.g. '-s 3' for 3 seconds or '-s
          0.05' for 50ms.
  -f, --follow-redirects
          Controls automatic redirects. When enabled, the client will follow
          HTTP redirects (up to 10 by default). Note that for security, Basic
          Authentication credentials are intentionally not forwarded during
          redirects to prevent unintended credential exposure.
      --retries <RETRIES>
          Number of retries for failed requests (network errors or 5xx
          responses) [default: 0]
      --json
          Output the JSON report to stdout instead of the normal table output.
          Suppresses all other console output for clean piping.
      --config <CONFIG>
          Path to a TOML config file. Defaults to `.siteprobe.toml` in the
          current directory.
  -h, --help
          Print help
  -V, --version
          Print version

EXIT CODES:
0  All URLs returned 2xx (success)
1  One or more URLs returned 4xx/5xx or failed
2  One or more URLs exceeded the slow threshold (--slow-threshold)

Authentication & Custom Headers

Siteprobe supports several ways to authenticate requests:

# Basic Authentication
siteprobe https://example.com/sitemap.xml --basic-auth user:password

# Bearer token (via custom header)
siteprobe https://example.com/sitemap.xml -H "Authorization: Bearer <token>"

# Send a session cookie
siteprobe https://example.com/sitemap.xml -H "Cookie: sessionid=abc123def456"

You can combine multiple -H flags to send several custom headers at once:

siteprobe https://example.com/sitemap.xml \
  -H "Authorization: Bearer <token>" \
  -H "Cookie: sessionid=abc123" \
  -H "X-Custom-Header: value"

If both --basic-auth and -H "Authorization: ..." are provided, the -H value takes precedence.

Example Usage

# Fetch and analyze a sitemap with default settings
siteprobe https://example.com/sitemap.xml

# Save the report to a specific file
siteprobe https://example.com/sitemap.xml --report-path ./results/report.csv --output-dir ./example.com

# Set concurrency limit to 10 and timeout to 5 seconds
siteprobe https://example.com/sitemap.xml --concurrency-limit 10 --request-timeout 5

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

siteprobe-1.3.0-py3-none-manylinux_2_34_x86_64.whl (2.8 MB view details)

Uploaded Python 3manylinux: glibc 2.34+ x86-64

siteprobe-1.3.0-py3-none-manylinux_2_34_aarch64.whl (2.8 MB view details)

Uploaded Python 3manylinux: glibc 2.34+ ARM64

siteprobe-1.3.0-py3-none-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded Python 3macOS 11.0+ ARM64

siteprobe-1.3.0-py3-none-macosx_10_12_x86_64.whl (2.7 MB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file siteprobe-1.3.0-py3-none-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for siteprobe-1.3.0-py3-none-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fa2b0ddf630794deb4ab45054c97cda1aab52124bed76dcd3132a6ec728d7faa
MD5 6d24293a38b51989de7c0c7f88093bdd
BLAKE2b-256 86b5b4f28705bbaac4de608052594fea4935fcf47f4353af170777c4e6be6d69

See more details on using hashes here.

File details

Details for the file siteprobe-1.3.0-py3-none-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for siteprobe-1.3.0-py3-none-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 90ffb0ccf00816e2419c8cfab6b86394742c623a81791ee82b9436391fc1198e
MD5 d46be34df1de7d7da54329bba97702dd
BLAKE2b-256 bf29aa061c0f9904efb7715375c1424bfcf8b742c5f4e3e4b33e065ae59eb55f

See more details on using hashes here.

File details

Details for the file siteprobe-1.3.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for siteprobe-1.3.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b6c936501d51537fff8bd25d88e547e80864198a4ddb1cbfc118933dbbe72e2e
MD5 8b1f3dd65c494682623b405a30959fdc
BLAKE2b-256 35ec5d36f60eb439608f49653d936976ea1791ac5915bf425c4f9413e6d11e21

See more details on using hashes here.

File details

Details for the file siteprobe-1.3.0-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for siteprobe-1.3.0-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3914bf07d81ad327d756338f494041cbbd718c3a195b35e049b43b9a4528e55a
MD5 4a149c4e17693bb5d4b3ec231dee2f1b
BLAKE2b-256 7868d140578bce5605004983771f379bc83bc9c5ed76d0bf3b8a9350602f18e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page