Siteprobe

Siteprobe is a Rust-based CLI tool that fetches all URLs from a given sitemap.xml url, checks their existence, and generates a performance report. It supports various features such as authentication, concurrency control, caching bypass, and more.

Screenshot of Siteprobe statistics

Features

  • Fetch and parse sitemap.xml to extract URLs, including nested Sitemap Index files recursively.
  • Check the existence and response times of each URL.
  • Generate a detailed performance CSV report.
  • Support for Basic Authentication.
  • Adjustable concurrency limits for request handling.
  • Configurable request timeout settings.
  • Support for configuring rate limits, such as 300 requests per 5-minute interval.
  • Redirect handling with security precautions.
  • Filtering and reporting slow URLs based on a threshold.
  • Custom User-Agent header support.
  • Option to append random timestamps to URLs to bypass caching mechanisms.
  • Save downloaded documents for further inspection or use as a static site mirror.

Installation

Run without installing

uvx siteprobe https://example.com/sitemap.xml
# or
pipx run siteprobe https://example.com/sitemap.xml

Install via package manager

# Homebrew (macOS/Linux)
brew install bartTC/siteprobe/siteprobe

# pip / pipx
pip install siteprobe
pipx install siteprobe

# Cargo
cargo install siteprobe

Build from source

git clone https://github.com/bartTC/siteprobe.git
cd siteprobe
cargo build --release

Usage

siteprobe <sitemap_url> [OPTIONS]

Arguments

  • <sitemap_url> - The URL of the sitemap to be fetched and processed.

Options

Usage: siteprobe [OPTIONS] <SITEMAP_URL>

Arguments:
  <SITEMAP_URL>  The URL of the sitemap to be fetched and processed.

Options:
      --basic-auth <BASIC_AUTH>
          Basic authentication credentials in the format `username:password`
  -H, --header <HEADERS>
          Custom header to include in each request (format: 'Name: Value'). Can
          be specified multiple times.
  -c, --concurrency-limit <CONCURRENCY_LIMIT>
          Maximum number of concurrent requests allowed [default: 4]
  -l, --rate-limit <RATE_LIMIT>
          The rate limit for all requests in the format 'requests/time[unit]',
          where unit can be seconds (`s`), minutes (`m`), or hours (`h`). E.g.
          '-l 300/5m' for 300 requests per 5 minutes, or '-l 100/1h' for 100
          requests per hour.
  -o, --output-dir <OUTPUT_DIR>
          Directory where all downloaded documents will be saved
  -a, --append-timestamp
          Append a random timestamp to each URL to bypass caching mechanisms
  -r, --report-path <REPORT_PATH>
          File path for storing the generated `report.csv`
  -j, --report-path-json <REPORT_PATH_JSON>
          File path for storing the generated `report.json`
      --report-path-html <REPORT_PATH_HTML>
          File path for storing the generated `report.html`
  -t, --request-timeout <REQUEST_TIMEOUT>
          Default timeout (in seconds) for each request [default: 10]
      --user-agent <USER_AGENT>
          Custom User-Agent header to be used in requests [default: "Mozilla/5.0
          (compatible; Siteprobe/1.3.0)"]
      --slow-num <SLOW_NUM>
          Limit the number of slow documents displayed in the report. [default:
          100]
  -s, --slow-threshold <SLOW_THRESHOLD>
          Show slow responses. The value is the threshold (in seconds) for
          considering a document as 'slow'. E.g. '-s 3' for 3 seconds or '-s
          0.05' for 50ms.
  -f, --follow-redirects
          Controls automatic redirects. When enabled, the client will follow
          HTTP redirects (up to 10 by default). Note that for security, Basic
          Authentication credentials are intentionally not forwarded during
          redirects to prevent unintended credential exposure.
      --retries <RETRIES>
          Number of retries for failed requests (network errors or 5xx
          responses) [default: 0]
      --json
          Output the JSON report to stdout instead of the normal table output.
          Suppresses all other console output for clean piping.
      --config <CONFIG>
          Path to a TOML config file. Defaults to `.siteprobe.toml` in the
          current directory.
  -h, --help
          Print help
  -V, --version
          Print version

EXIT CODES:
0  All URLs returned 2xx (success)
1  One or more URLs returned 4xx/5xx or failed
2  One or more URLs exceeded the slow threshold (--slow-threshold)

Authentication & Custom Headers

Siteprobe supports several ways to authenticate requests:

# Basic Authentication
siteprobe https://example.com/sitemap.xml --basic-auth user:password

# Bearer token (via custom header)
siteprobe https://example.com/sitemap.xml -H "Authorization: Bearer <token>"

# Send a session cookie
siteprobe https://example.com/sitemap.xml -H "Cookie: sessionid=abc123def456"

You can combine multiple -H flags to send several custom headers at once:

siteprobe https://example.com/sitemap.xml \
  -H "Authorization: Bearer <token>" \
  -H "Cookie: sessionid=abc123" \
  -H "X-Custom-Header: value"

If both --basic-auth and -H "Authorization: ..." are provided, the -H value takes precedence.

Example Usage

# Fetch and analyze a sitemap with default settings
siteprobe https://example.com/sitemap.xml

# Save the report to a specific file
siteprobe https://example.com/sitemap.xml --report-path ./results/report.csv --output-dir ./example.com

# Set concurrency limit to 10 and timeout to 5 seconds
siteprobe https://example.com/sitemap.xml --concurrency-limit 10 --request-timeout 5

Changelog

v1.3.0 (2026-02-16)

  • Added gzip sitemap support. Siteprobe now handles .xml.gz sitemaps, detecting gzip compression via URL suffix or magic bytes and decompressing automatically. Sitemap index files referencing .xml.gz entries are also supported.
  • Added meaningful exit codes for CI/CD integration: 0 for success, 1 if any URL returned 4xx/5xx or failed, 2 if any URL exceeded the slow threshold (--slow-threshold).
  • Added --retries N option (default: 0) to retry failed requests. Retries on network errors or 5xx responses with a 1-second delay between attempts.
  • Added --json flag to output the JSON report to stdout, suppressing all other console output for clean piping into other tools.
  • Added --report-path-html option to generate a self-contained HTML report with summary statistics, response time distribution histogram, status code breakdown chart, and a sortable table of all responses.
  • Added -H / --header option to send custom headers with every request. Supports any Name: Value format and can be repeated for multiple headers. Useful for token-based auth, session cookies, API keys, etc. Also supported in the .siteprobe.toml config file via the headers array field.
  • Added .siteprobe.toml config file support. Options can be set in a TOML file (loaded from the current directory by default, or via --config). CLI arguments take priority over config file values.
  • Updated README with all installation methods (uvx, pipx, Homebrew, pip, Cargo).

v1.2.2 (2026-02-16)

  • Downgraded Rust edition from 2024 to 2021 for compatibility with older Rust toolchains (e.g., Cargo 1.75 shipped with Ubuntu). Replaced let chains and adjusted never-type fallback usage to compile under edition 2021.
  • Switched TLS backend from OpenSSL to rustls. This eliminates the runtime dependency on system OpenSSL libraries, fixing "libssl not found" errors when installing via uvx/pip on Linux.

v1.2.1 (2026-01-20)

  • Added Homebrew installation support (brew install bartTC/siteprobe/siteprobe).
  • Added PyPI installation support (pip install siteprobe or pipx install siteprobe).
  • Shortened package description for Homebrew compatibility.

v1.2.0 (2026-01-01)

  • Added tilde (~) expansion support for path arguments (--report-path, --report-path-json, --output-dir). Previously, using the = syntax (e.g., --report-path-json=~/report.json) would fail because the shell doesn't expand ~ in that context.

v1.1.0 (2025-11-23)

  • Fixed a division by zero error when the sitemap contains no URLs or no URLs are processed.
  • Fixed table border misalignment in the report by replacing emojis with inconsistent width handling.
  • Fixed potential integer overflow in random number generation.
  • Fixed type mismatches for SLOW_NUM and request_timeout options.

v1.0.0 (2025-09-05)

  • This has demonstrated stability and maturity, making it suitable for a v1.0 release.

v0.5.2 (2025-05-11)

  • Fixed an issue where the calculated rate goes under the rate limiter threshold of 1 per minute.

v0.5.0 (2025-06-07)

  • Enhance the clarity of error messages.
  • Introduced a new rate-limiting feature, allowing users to define the rate at which sitemap URLs are fetched. E.g.: 60 requests per minute (-l 60/1m) or 300 requests every 5 minutes (-l 300/5m).

v0.4.0 (2025-05-11)

  • An appropriate error message will be displayed for an invalid sitemap URL.

v0.3.0 (2025-04-27)

  • Introduced the --report-path-json option to generate a detailed request and performance report in JSON format.

v0.2.0 (2025-03-12)

  • The 'slow responses' list is now optional and will only be displayed if the --slow-threshold option is specified.
  • The progress bar now shows the estimated remaining time.
  • Fixed an issue where the follow redirect option was not functioning as expected.

v0.1.0 (2025-03-11)

  • Initial release with all core features.