Crawl sitemap URLs with Crawl4AI and export as Markdown
Project description
Crawlboy
Sequentially crawls every URL from a sitemap (including nested sitemap indexes) with Crawl4AI and writes one Markdown file per page.
Output mirrors the URL path under --out-dir/md/: each path segment becomes a directory and the last segment becomes the filename.
/blog/articles/basic-git-commands/
→ {out-dir}/md/blog/articles/basic-git-commands.md
Site root (/) maps to {out-dir}/md/index.md. Failures are logged to {out-dir}/errors.jsonl.
Features
- Sitemap discovery — auto-detect from
robots.txtor common paths, or provide a direct URL - Nested sitemap indexes — recursively follows
<sitemapindex>entries - Markdown output — one
.mdfile per page, mirroring URL structure - HTML output — optional raw HTML under
{out-dir}/html/with--save-html - Image download — saves images to
{out-dir}/media/(content-addressed, deduped) and rewrites paths in Markdown/HTML with--download-images - Interactive CLI — guided wizard with questionary and Rich
- Docker support — runs headless out of the box
Installation
pip install crawlboy
crawl4ai-setup
crawl4ai-setupinstalls Playwright/Chromium and can use several hundred MB of disk space. If something fails, runcrawl4ai-doctor.
From source
git clone https://github.com/aksharahegde/crawlboy.git
cd crawlboy
pip install -e .
crawl4ai-setup
Usage
Direct sitemap URL
crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out
Auto-discover from site root
crawlboy --site-url 'https://www.example.com' --out-dir ./out
Discovery order: robots.txt → /sitemap.xml → /sitemap_index.xml → /sitemap-index.xml → /wp-sitemap.xml.
By default with --site-url, only URLs matching the site origin host are crawled. Use --include-offsite-urls to crawl all listed URLs.
Interactive mode
crawlboy --interactive
# or
crawlboy -i
The wizard walks through URL mode, output directory, crawl options, and advanced settings before confirming. Requires a TTY — for Docker use docker run -it ....
Options
| Flag | Description | Default |
|---|---|---|
--sitemap-url |
Direct sitemap URL | — |
--site-url |
Site root for auto-discovery | — |
--out-dir |
Output directory | — |
--delay |
Seconds to wait after each page | 0 |
--page-timeout-ms |
Navigation timeout in ms | 60000 |
--max-urls |
Cap number of URLs (for testing) | unlimited |
--save-html |
Write raw HTML under html/ |
off |
--download-images |
Save images under media/ and rewrite paths |
off |
--no-headless |
Show the browser window | off |
--fail-fast |
Stop on first crawl error | off |
--include-offsite-urls |
Crawl hosts outside site origin (--site-url only) |
off |
--allow-unsafe-network-targets |
Allow private/loopback/link-local/reserved targets (unsafe) | off |
--max-sitemap-depth |
Maximum nested sitemap index depth | 32 |
--max-sitemap-urls |
Maximum URLs accepted from sitemap expansion | 50000 |
--max-sitemap-bytes |
Maximum decoded bytes per sitemap payload | 5000000 |
--max-media-file-bytes |
Maximum bytes for one downloaded media file | 10000000 |
--max-media-total-bytes |
Maximum total bytes for downloaded media files | 250000000 |
-i, --interactive |
Launch guided wizard | off |
Docker
docker build -t crawlboy .
docker run --rm -v "$(pwd)/out:/out" crawlboy \
--site-url 'https://www.example.com' --out-dir /out
For LLM Agents
Instructions for AI coding agents (Claude Code, Cursor, Copilot, etc.) to install and use Crawlboy.
Installation
You are an autonomous coding agent. Execute the following installation steps exactly and sequentially in a Unix-like shell environment.
Constraints:
- Stop immediately if any command fails and report the error
- Do not skip steps
- Assume Python 3 and pip are available
- Use the project directory as the working directory after cloning
Steps to execute:
git clone https://github.com/aksharahegde/crawlboy.git
cd crawlboy
python -m venv .venv
source .venv/bin/activate
pip install -e .
crawl4ai-setup
After completion:
- Confirm success by checking that dependencies are installed
- Report final status (success/failure)
- If failure occurs, output the exact failing command and error message
Running a Scrape
You are an autonomous coding agent. Execute the following crawl task using the installed `crawlboy` tool in a Unix-like shell.
Constraints:
- Ensure the virtual environment is activated before running any command
- Stop immediately if any command fails and report the error
- Do not modify flags unless explicitly instructed
- Assume the tool is already installed in the current project
Task Options (choose ONE based on input):
Option A — Crawl using site URL (auto-discover sitemap):
source .venv/bin/activate
crawlboy --site-url 'https://www.example.com' --out-dir ./out
Option B — Crawl using direct sitemap URL:
source .venv/bin/activate
crawlboy --sitemap-url 'https://example.com/sitemap.xml' --out-dir ./out
Optional Flags (use only if specified in input):
--max-urls N
--save-html
--download-images
--delay N
--fail-fast
Execution Rules:
- Always activate the virtual environment first
- Append any specified flags to the crawlboy command
- Ensure output directory is created if it does not exist
After completion:
- Verify that the ./out directory exists
- Confirm presence of expected outputs:
- md/ (Markdown files)
- html/ (if --save-html used)
- media/ (if --download-images used)
- errors.jsonl (if any failures occurred)
- Report final status (success/failure)
- If failure occurs, output the exact failing command and error message
Verified Working
Crawlboy has been tested and verified to work with real websites. Here's a recent test run:
Command:
crawlboy --site-url 'https://aksharahegde.xyz' --out-dir ./test-output --max-urls 5
Test Results:
- Auto-discovery: Found sitemap from
robots.txt✓ - Pages crawled: 5/5 successful, 0 failures
- Total time: ~13 seconds
- Redirect handling: Automatically handled aksharahegde.xyz → www.aksharahegde.xyz
Output Generated:
test-output/md/
├── index.md (3.1 KB) — homepage content
├── blog.md (3.2 KB) — blog page
├── projects.md (5.3 KB) — projects page
├── resources.md (628 B) — resources page
└── shop.md (564 B) — shop page
All Markdown files contain properly formatted content with preserved links and page structure.
Contributing
Contributions are welcome! Please follow these guidelines:
Code Style
- Use Python 3.10+ compatible syntax
- Format code with Black — run
black .before committing - Lint with ruff — run
ruff check . - Follow PEP 8 conventions
Commit Messages
- Use present tense, imperative mood ("add feature", not "added feature")
- Be concise and descriptive (under 70 characters for the subject line)
- Reference issues if applicable (e.g., "Fix #123")
- Example:
Fix image path rewriting for nested URLs
Pull Requests
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit your changes with clear messages
- Test thoroughly before pushing
- Open a PR with a clear description of changes
Testing
- Test the interactive CLI locally:
crawlboy -i - Test with both
--site-urland--sitemap-urlmodes - Verify output structure matches documentation
- Test with
--download-imagesand--save-htmlflags - Run against a small sitemap first (use
--max-urls)
Reporting Issues
Include:
- Steps to reproduce
- Python version and OS
- Full error output or stack trace
- Sample sitemap URL (if possible)
Security
Crawlboy applies secure defaults to reduce abuse risk from untrusted sitemap and page content:
- Deny-by-default network target checks for private, loopback, link-local, multicast, reserved, and unspecified IP ranges
- Sitemap depth, URL count, and payload-size limits
- Media download file-size and total-size limits
- URL redaction in logs and
errors.jsonloutput - Output path containment under
--out-dir
Use --allow-unsafe-network-targets only in trusted internal environments where private network crawling is intentional.
See SECURITY.md and docs/security/ for reporting and control details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlboy-1.1.0.tar.gz.
File metadata
- Download URL: crawlboy-1.1.0.tar.gz
- Upload date:
- Size: 59.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
282f979583824e985dbe25e22b0e6b50c5ebd8e2e19f346be172489d64c601d2
|
|
| MD5 |
2964b056af92abe344849d22f240091e
|
|
| BLAKE2b-256 |
afef3c54345de9437097199a45fc2fb14aafeef4f8c494df28ef705cdffe564c
|
File details
Details for the file crawlboy-1.1.0-py3-none-any.whl.
File metadata
- Download URL: crawlboy-1.1.0-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3720d3205d916c764ca461d8fb80344eb48952d534f8081f19b25957905b29e2
|
|
| MD5 |
60e724e7780ca789e606704112869834
|
|
| BLAKE2b-256 |
804bac97ae25a6326fe1267a946d18fd56333b50bf09648b233b614e10cb3376
|