Crawl sitemap URLs with Crawl4AI and export as Markdown
Project description
Crawlboy
Sequentially crawls every URL from a sitemap (including nested sitemap indexes) with Crawl4AI and writes one Markdown file per page.
Output mirrors the URL path under --out-dir/md/: each path segment becomes a directory and the last segment becomes the filename.
/blog/articles/basic-git-commands/
→ {out-dir}/md/blog/articles/basic-git-commands.md
Site root (/) maps to {out-dir}/md/index.md. Failures are logged to {out-dir}/errors.jsonl.
Features
- Sitemap discovery — auto-detect from
robots.txtor common paths, or provide a direct URL - Nested sitemap indexes — recursively follows
<sitemapindex>entries - Markdown output — one
.mdfile per page, mirroring URL structure - HTML output — optional raw HTML under
{out-dir}/html/with--save-html - Image download — saves images to
{out-dir}/media/(content-addressed, deduped) and rewrites paths in Markdown/HTML with--download-images - Interactive CLI — guided wizard with questionary and Rich
- Docker support — runs headless out of the box
Installation
pip install crawlboy
crawl4ai-setup
crawl4ai-setupinstalls Playwright/Chromium and can use several hundred MB of disk space. If something fails, runcrawl4ai-doctor.
From source
git clone https://github.com/aksharahegde/crawlboy.git
cd crawlboy
pip install -e .
crawl4ai-setup
Usage
Direct sitemap URL
crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out
Auto-discover from site root
crawlboy --site-url 'https://www.example.com' --out-dir ./out
Discovery order: robots.txt → /sitemap.xml → /sitemap_index.xml → /sitemap-index.xml → /wp-sitemap.xml.
By default with --site-url, only URLs matching the site origin host are crawled. Use --include-offsite-urls to crawl all listed URLs.
Interactive mode
crawlboy --interactive
# or
crawlboy -i
The wizard walks through URL mode, output directory, crawl options, and advanced settings before confirming. Requires a TTY — for Docker use docker run -it ....
Options
| Flag | Description | Default |
|---|---|---|
--sitemap-url |
Direct sitemap URL | — |
--site-url |
Site root for auto-discovery | — |
--out-dir |
Output directory | — |
--delay |
Seconds to wait after each page | 0 |
--page-timeout-ms |
Navigation timeout in ms | 60000 |
--max-urls |
Cap number of URLs (for testing) | unlimited |
--save-html |
Write raw HTML under html/ |
off |
--download-images |
Save images under media/ and rewrite paths |
off |
--no-headless |
Show the browser window | off |
--fail-fast |
Stop on first crawl error | off |
--include-offsite-urls |
Crawl hosts outside site origin (--site-url only) |
off |
-i, --interactive |
Launch guided wizard | off |
Docker
docker build -t crawlboy .
docker run --rm -v "$(pwd)/out:/out" crawlboy \
--site-url 'https://www.example.com' --out-dir /out
Contributing
Contributions are welcome! Please follow these guidelines:
Code Style
- Use Python 3.10+ compatible syntax
- Format code with Black — run
black .before committing - Lint with ruff — run
ruff check . - Follow PEP 8 conventions
Commit Messages
- Use present tense, imperative mood ("add feature", not "added feature")
- Be concise and descriptive (under 70 characters for the subject line)
- Reference issues if applicable (e.g., "Fix #123")
- Example:
Fix image path rewriting for nested URLs
Pull Requests
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit your changes with clear messages
- Test thoroughly before pushing
- Open a PR with a clear description of changes
Testing
- Test the interactive CLI locally:
crawlboy -i - Test with both
--site-urland--sitemap-urlmodes - Verify output structure matches documentation
- Test with
--download-imagesand--save-htmlflags - Run against a small sitemap first (use
--max-urls)
Reporting Issues
Include:
- Steps to reproduce
- Python version and OS
- Full error output or stack trace
- Sample sitemap URL (if possible)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlboy-1.0.0.tar.gz.
File metadata
- Download URL: crawlboy-1.0.0.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f666916ceb8f8349a3fa9a9485ccaaaec2237e89e531b4eae381ed46b12b5db5
|
|
| MD5 |
9f8f8f03ffedaeafdb2f14b6ab5f4a9c
|
|
| BLAKE2b-256 |
1aaa70a01632aa360c0740a9e198782be33df87d6e8f09f59a429d55cda108ff
|
File details
Details for the file crawlboy-1.0.0-py3-none-any.whl.
File metadata
- Download URL: crawlboy-1.0.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcd5ed4c87a3ded3951aec8d34f4b3f4a252465b7511762bf056421f71cfdc57
|
|
| MD5 |
e304fe7a52fe76706b6b2428767c52ba
|
|
| BLAKE2b-256 |
7004886a5bcb8761046f08da10512a0a30d9ea15ccb18e86c4dc54650d377c14
|