Crawl documentation websites and convert to Markdown files
Project description
crawl2md
Crawl documentation websites and convert them to Markdown files.
crawl2md is a command-line tool that:
- Crawls documentation websites using breadth-first search
- Extracts the main content from each page
- Converts HTML to clean Markdown
- Adds optional Obsidian-compatible YAML frontmatter
- Mirrors the URL structure as a local directory tree
Installation
Using pip
pip install .
Using pipx (recommended for CLI tools)
pipx install .
Development install
make dev-install
# or
pip install -e .
Installing the man page
sudo make man-install
Quick Start
Crawl an entire documentation site:
crawl2md -s https://docs.example.com/
Crawl only a specific section:
crawl2md -s https://docs.example.com/tutorial/ -p /tutorial
Crawl with custom output directory and tags:
crawl2md -s https://docs.example.com/ -o my-docs -t docs reference
Crawl with plain output (disables default TUI):
crawl2md -s https://docs.example.com/ --no-tui
Crawl with deduplication to skip duplicate content:
crawl2md -s https://docs.example.com/ --dedupe
Usage
crawl2md [options]
Options:
-s, --start URL Starting URL to crawl (required)
-b, --base BASE_URL Base URL to constrain crawling
-o, --output DIR Output directory (default: docs-md)
-t, --tags TAG [TAG ...] Tags for YAML frontmatter
-p, --restrict-prefix PREFIX Only crawl paths starting with PREFIX
-e, --exclude-patterns PATTERN [...] Exclude URLs matching patterns
-d, --delay SECONDS Delay between requests (default: 0.3)
-m, --max-pages N Maximum pages to process
--no-frontmatter Disable YAML frontmatter
--user-agent STRING Custom User-Agent header
-v, --verbose Enable verbose logging
--no-tui Disable TUI, use plain output (TUI is default)
--dedupe Enable content deduplication
--scroll-lines N Lines to scroll per keypress in TUI (default: auto)
--max-log-lines N Maximum log buffer size in TUI (default: unlimited)
--version Show version and exit
Output Format
Directory Structure
URLs are mapped to local files:
| URL | File |
|---|---|
https://example.com/ |
docs-md/index.md |
https://example.com/intro/ |
docs-md/intro.md |
https://example.com/guide/ |
docs-md/guide/index.md |
https://example.com/guide/api/ |
docs-md/guide/api.md |
YAML Frontmatter
By default, each file includes Obsidian-compatible frontmatter:
---
title: "Page Title"
source: https://example.com/page/
created: 2025-12-01
tags:
- docs
- tutorial
---
Use --no-frontmatter to disable this.
Interactive TUI Mode
The curses-based TUI provides real-time monitoring and control for long-running crawls.
TUI mode is enabled by default; use --no-tui for plain output.
TUI Features
Real-time Statistics:
- Pages processed, files saved, duplicates skipped, errors
- Current URL being crawled
- Queue size and elapsed time
- Current crawl speed (delay between requests)
Interactive Controls:
| Key | Action | Description |
|---|---|---|
q |
Quit | Stop crawl and exit |
p |
Pause/Resume | Pause or resume the crawl |
h |
Help | Toggle help overlay with all controls |
m |
Menu | Open mid-crawl configuration menu |
c |
Center | Re-center queue view on current item |
u |
URL Toggle | Switch between path-only and full URL display |
↑/↓ |
Scroll | Scroll log window up/down by one line |
PgUp/PgDn |
Page Scroll | Scroll log window by half page |
Home/End |
Jump | Jump to oldest/newest logs |
Esc |
Close | Close help overlay or config menu |
Mouse Support:
- Scroll wheel to scroll log window
Adaptive Layout:
- Works on terminals as small as 1 line (graceful degradation)
- Automatically adjusts panel visibility based on terminal size
- Shows warning when terminal is too small for full view
- Handles terminal resize without crashes
Error Handling:
- Terminal always restored on exit (even on crashes)
- Crawler errors displayed in overlay panel
- Clean exit with 'q' even in error state
TUI Screenshot
[⠋] Pages: 1234 | Saved: 1180 | Dups: 54 | Errors: 0 | Queue: 23 | Elapsed: 05:32 | Speed: 0.5s
Current: ...example.com/docs/advanced/configuration#authentication
Fetching: https://example.com/docs/setup
Saved: docs-md/setup.md
Queued: https://example.com/docs/install
Fetching: https://example.com/docs/install
DUPLICATE body: https://example.com/docs/install → same as docs-md/setup.md
q:quit p:pause c:center u:url m:menu ↑/↓:scroll h:help
Troubleshooting TUI
Terminal Issues:
- If terminal appears broken after crash, run:
reset - Ensure terminal supports Unicode (for spinner animation)
- Minimum terminal size: 1 line (but 10+ lines recommended for full view)
Performance:
- TUI updates at ~10 Hz (every 100ms)
- No significant overhead on crawler performance
- Safe to use on long-running crawls (hours+)
Content Deduplication
Skip saving duplicate content to avoid redundant files:
crawl2md -s https://docs.example.com/ --dedupe
How it works:
- Computes SHA256 hash of markdown body (excluding frontmatter)
- First occurrence is saved normally
- Subsequent pages with identical content are skipped
- Duplicates counter incremented in stats/logs
Use cases:
- Documentation sites with mirrors/aliases
- Sites with "print" versions of pages
- Multi-language sites with untranslated pages
Configuration
Configuration can be provided via:
- CLI arguments (highest priority)
- Environment variables
- Config file (
crawl2md.toml) - Built-in defaults (lowest priority)
Config File
Create crawl2md.toml in your working directory or ~/.config/crawl2md/:
[crawl2md]
start_url = "https://docs.example.com/"
output = "my-docs"
tags = ["docs", "reference"]
delay = 0.5
verbose = true
no_tui = false
dedupe = true
Environment Variables
| Variable | Description |
|---|---|
CRAWL2MD_START_URL |
Starting URL |
CRAWL2MD_BASE_URL |
Base URL constraint |
CRAWL2MD_OUTPUT |
Output directory |
CRAWL2MD_TAGS |
Comma-separated tags |
CRAWL2MD_RESTRICT_PREFIX |
Path prefix filter |
CRAWL2MD_EXCLUDE_PATTERNS |
Comma-separated URL exclusion patterns |
CRAWL2MD_DELAY |
Request delay (seconds) |
CRAWL2MD_MAX_PAGES |
Max pages to process |
CRAWL2MD_NO_FRONTMATTER |
Disable frontmatter ("1" or "true") |
CRAWL2MD_USER_AGENT |
Custom User-Agent |
CRAWL2MD_VERBOSE |
Enable verbose mode ("1" or "true") |
CRAWL2MD_NO_TUI |
Disable TUI ("1" or "true") |
CRAWL2MD_DEDUPE |
Enable deduplication ("1" or "true") |
CRAWL2MD_SCROLL_LINES |
Lines to scroll per keypress in TUI |
CRAWL2MD_MAX_LOG_LINES |
Maximum log buffer size in TUI |
Development
Install dev dependencies:
pip install -e ".[dev]"
Run checks:
make format- Format code with Blackmake lint- Lint with Ruffmake typecheck- Type check with mypymake test- Run tests with pytest (tests are intests/)make check- Run all checks (format check, lint, typecheck)
Requirements
- Python 3.9+
- requests
- beautifulsoup4
- markdownify
- tomli (Python < 3.11 only)
Limitations
- Designed for static HTML documentation sites
- Does not execute JavaScript (no headless browser)
- Does not download images or rewrite internal links
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawl2md-0.1.0.tar.gz.
File metadata
- Download URL: crawl2md-0.1.0.tar.gz
- Upload date:
- Size: 65.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75788b609db8cd82068fd92ae6cc44cc5c129c238378a845589e4adab6a48833
|
|
| MD5 |
6e71e771786a74177f7269ea430f2308
|
|
| BLAKE2b-256 |
b55b4b9dafe0f75c5510442a383d3a777e8c6eba2e112d0fcfe227cd3635a7f9
|
File details
Details for the file crawl2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: crawl2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 56.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e935b2b9bb8c2ebe08eb1dac15018fea599acfb2ac61e57e61ec8d30678810e
|
|
| MD5 |
ae7001bab4d9b0e05d1c579e7203941a
|
|
| BLAKE2b-256 |
d56723975798a17c382950bca66bcf4f5b9bf3df4270e9a9ce6121423d4dc678
|