Skip to main content

Crawl sitemap URLs with Crawl4AI and export as Markdown

Project description

Crawlboy

Sequentially crawls every URL from a sitemap (including nested sitemap indexes) with Crawl4AI and writes one Markdown file per page.

Output mirrors the URL path under --out-dir/md/: each path segment becomes a directory and the last segment becomes the filename.

/blog/articles/basic-git-commands/
→ {out-dir}/md/blog/articles/basic-git-commands.md

Site root (/) maps to {out-dir}/md/index.md. Failures are logged to {out-dir}/errors.jsonl.

Features

  • Sitemap discovery — auto-detect from robots.txt or common paths, or provide a direct URL
  • Nested sitemap indexes — recursively follows <sitemapindex> entries
  • Markdown output — one .md file per page, mirroring URL structure
  • HTML output — optional raw HTML under {out-dir}/html/ with --save-html
  • Image download — saves images to {out-dir}/media/ (content-addressed, deduped) and rewrites paths in Markdown/HTML with --download-images
  • Interactive CLI — guided wizard with questionary and Rich
  • Docker support — runs headless out of the box

Installation

pip install crawlboy
crawl4ai-setup

crawl4ai-setup installs Playwright/Chromium and can use several hundred MB of disk space. If something fails, run crawl4ai-doctor.

From source

git clone https://github.com/aksharahegde/crawlboy.git
cd crawlboy
pip install -e .
crawl4ai-setup

Usage

Direct sitemap URL

crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out

Auto-discover from site root

crawlboy --site-url 'https://www.example.com' --out-dir ./out

Discovery order: robots.txt/sitemap.xml/sitemap_index.xml/sitemap-index.xml/wp-sitemap.xml.

By default with --site-url, only URLs matching the site origin host are crawled. Use --include-offsite-urls to crawl all listed URLs.

Interactive mode

crawlboy --interactive
# or
crawlboy -i

The wizard walks through URL mode, output directory, crawl options, and advanced settings before confirming. Requires a TTY — for Docker use docker run -it ....

Options

Flag Description Default
--sitemap-url Direct sitemap URL
--site-url Site root for auto-discovery
--out-dir Output directory
--delay Seconds to wait after each page 0
--page-timeout-ms Navigation timeout in ms 60000
--max-urls Cap number of URLs (for testing) unlimited
--save-html Write raw HTML under html/ off
--download-images Save images under media/ and rewrite paths off
--no-headless Show the browser window off
--fail-fast Stop on first crawl error off
--include-offsite-urls Crawl hosts outside site origin (--site-url only) off
-i, --interactive Launch guided wizard off

Docker

docker build -t crawlboy .
docker run --rm -v "$(pwd)/out:/out" crawlboy \
  --site-url 'https://www.example.com' --out-dir /out

Contributing

Contributions are welcome! Please follow these guidelines:

Code Style

  • Use Python 3.10+ compatible syntax
  • Format code with Black — run black . before committing
  • Lint with ruff — run ruff check .
  • Follow PEP 8 conventions

Commit Messages

  • Use present tense, imperative mood ("add feature", not "added feature")
  • Be concise and descriptive (under 70 characters for the subject line)
  • Reference issues if applicable (e.g., "Fix #123")
  • Example: Fix image path rewriting for nested URLs

Pull Requests

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Commit your changes with clear messages
  4. Test thoroughly before pushing
  5. Open a PR with a clear description of changes

Testing

  • Test the interactive CLI locally: crawlboy -i
  • Test with both --site-url and --sitemap-url modes
  • Verify output structure matches documentation
  • Test with --download-images and --save-html flags
  • Run against a small sitemap first (use --max-urls)

Reporting Issues

Include:

  • Steps to reproduce
  • Python version and OS
  • Full error output or stack trace
  • Sample sitemap URL (if possible)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlboy-1.0.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawlboy-1.0.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file crawlboy-1.0.0.tar.gz.

File metadata

  • Download URL: crawlboy-1.0.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for crawlboy-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f666916ceb8f8349a3fa9a9485ccaaaec2237e89e531b4eae381ed46b12b5db5
MD5 9f8f8f03ffedaeafdb2f14b6ab5f4a9c
BLAKE2b-256 1aaa70a01632aa360c0740a9e198782be33df87d6e8f09f59a429d55cda108ff

See more details on using hashes here.

File details

Details for the file crawlboy-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: crawlboy-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for crawlboy-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fcd5ed4c87a3ded3951aec8d34f4b3f4a252465b7511762bf056421f71cfdc57
MD5 e304fe7a52fe76706b6b2428767c52ba
BLAKE2b-256 7004886a5bcb8761046f08da10512a0a30d9ea15ccb18e86c4dc54650d377c14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page