A CLI tool to crawl dynamic/static websites and convert content to clean Markdown
Project description
web2md
A powerful, intelligent CLI tool to crawl dynamic and static websites with full JavaScript rendering support and convert them to clean, well-formatted Markdown files. Perfect for archiving documentation, creating offline knowledge bases, and preserving web content.
✨ Key Features
- 🚀 Dynamic Site Support: Full JavaScript rendering via Playwright (Vue/React/Angular/Next.js)
- 🎯 Smart Content Extraction: Automatically identifies and extracts core content, removing navigation, ads, and sidebars
- 🔗 Recursive Crawling: Intelligently crawls subpages with configurable depth and count limits
- �️ Media Downloads: Optional image and video downloading with lazy-loading support
- 📐 Base URL Intelligence: Uses browser's
document.baseURIfor accurate relative path resolution - 🔄 Local Link Conversion: Automatically converts HTML links to local Markdown relative paths
- 🧹 Clean Output: Preserves tables, code blocks, images, links, and heading hierarchies
- 🔒 SSL Flexibility: Handles sites with certificate issues gracefully
- 🌍 Cross-Platform: Works on Windows, macOS, and Linux (Python 3.8+)
- 📋 Universal Compatibility: Generated Markdown works with Typora, Obsidian, VS Code, and more
📦 Installation
Option 1: Install from PyPI (Recommended)
pip3 install web2md
Option 2: Install from Source (For Development)
git clone https://github.com/floatinghotpot/web2md.git
cd web2md
python3 -m pip install -e .
Required: Install Playwright Browser
# Install Chromium driver (required for JavaScript rendering)
python3 -m playwright install chromium
# Linux only: Install system dependencies
python3 -m playwright install-deps chromium
🚀 Quick Start
Basic Usage
# Crawl a single page (auto-generated save directory)
web2md https://docs.python.org/3/tutorial/
# Specify custom save directory
web2md https://docs.python.org/3/tutorial/ ./python-docs
# Crawl with images
web2md https://example.com/docs --picture
# Limit crawl depth and count
web2md https://example.com/docs --depth 2 --count 10
# Crawl with images and videos
web2md https://example.com/docs --picture --video --depth 3
Show Help
web2md -h
📖 Usage
Command Syntax
web2md [URL] [SAVE_DIR] [OPTIONS]
Arguments
| Argument | Required | Description |
|---|---|---|
web_url |
✅ Yes | Target webpage URL (must start with http/https) |
save_folder |
❌ No | Local save directory (auto-generated from URL if omitted) |
Options
| Option | Default | Description |
|---|---|---|
--depth N |
5 |
Maximum relative crawl depth from base URL |
--count N |
999 |
Maximum number of pages to crawl (0 = unlimited) |
--picture |
False |
Download and save images to local images/ directory |
--video |
False |
Download and save videos to local videos/ directory |
-h, --help |
- | Show help message and exit |
Examples
1. Unlimited Crawl with Depth Limit
web2md https://company.com/docs/home company-docs --depth 2
- Crawls all pages within 2 levels of
/docs/ - Saves to
./company-docs/
2. Limited Page Count
web2md https://company.com/docs/home company-docs --depth 2 --count 5
- Stops after crawling 5 pages
- Useful for testing or sampling large sites
3. Crawl with Images
web2md https://company.com/docs/home --picture --count 3
- Downloads images to
images/subdirectory - Converts image URLs to local relative paths in Markdown
4. Auto-Generated Save Directory
web2md https://company.com/docs/home --depth 1 --count 10
- Auto-creates directory:
company_com_docs/
🎯 How It Works
1. Base URL Calculation
The tool automatically determines a base URL from your target URL:
- Target:
https://company.com/docs/home→ Base:https://company.com/docs/ - All crawling is scoped to pages under this base URL
2. Intelligent Path Resolution
Uses the browser's document.baseURI to correctly resolve relative URLs:
- Handles
<base>tags in HTML - Respects redirects and trailing slashes
- Resolves lazy-loaded images with
data-src,srcset, etc.
3. Smart Content Extraction
Automatically identifies core content using priority selectors:
<main>tag.article-contentor.article_content#main-content.content<article>tag- Fallback to
<body>(with cleanup)
4. Media Handling
When --picture or --video is enabled:
- Downloads media files to
images/orvideos/subdirectories - Generates unique filenames with MD5 hash to prevent duplicates
- Converts URLs to local relative paths in Markdown
- Supports lazy-loading attributes:
data-src,data-original,srcset
5. Filename Generation
MD filenames are generated from URLs:
- Remove base URL prefix
- Replace
/with_ - Filter illegal characters
- Example:
https://company.com/docs/api/auth→api_auth.md
⚙️ Configuration
Built-in Settings (in web2md/cli.py)
Playwright Configuration
PLAYWRIGHT_CONFIG = {
"headless": False, # Set to True for background crawling
"timeout": 60000, # Page load timeout (ms)
"wait_for_load": "networkidle", # Wait strategy
"sleep_after_load": 2, # Additional wait time (seconds)
"user_agent": "Mozilla/5.0..." # Custom user agent
}
Media Configuration
MEDIA_CONFIG = {
"timeout": 30000, # Media download timeout (ms)
"image_dir": "images", # Image save subdirectory
"video_dir": "videos", # Video save subdirectory
"allowed_img_ext": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp"],
"allowed_vid_ext": [".mp4", ".avi", ".mov", ".webm", ".flv", ".mkv"]
}
Content Filtering
REMOVE_TAGS = ["nav", "header", "footer", "aside", "script", "style", "iframe", "sidebar"]
CORE_CONTENT_SELECTORS = [
("main", {}),
("div", {"class_": "article-content"}),
("article", {})
]
Crawl Defaults
DEFAULT_CRAWL_CONFIG = {
"max_depth": 5, # Default max depth
"max_count": 999, # Default max pages
"allowed_schemes": ["http", "https"],
"exclude_patterns": [r"\.pdf$", r"\.zip$", r"\.exe$"]
}
🔧 Advanced Usage
Debug Mode (Show Browser)
Edit web2md/cli.py and set:
PLAYWRIGHT_CONFIG = {
"headless": False, # Shows browser window
...
}
Custom Content Selectors
Add site-specific selectors to CORE_CONTENT_SELECTORS:
CORE_CONTENT_SELECTORS = [
("main", {}),
("div", {"class_": "documentation-content"}), # Custom selector
("article", {})
]
Anti-Bot Detection
Install and use playwright-stealth:
pip3 install playwright-stealth
Add to get_dynamic_html() in web2md/cli.py:
from playwright_stealth import stealth_sync
page = context.new_page()
stealth_sync(page) # Add this line
page.goto(url, ...)
Authentication
Add login logic in get_dynamic_html() before page.goto():
page.goto("https://example.com/login")
page.fill("#username", "your-username")
page.fill("#password", "your-password")
page.click("#login-button")
time.sleep(2)
🐛 Troubleshooting
SSL Certificate Errors
The tool automatically disables SSL verification for downloads. If you encounter issues, check your network/firewall settings.
Timeout Errors
Increase timeout in PLAYWRIGHT_CONFIG:
"timeout": 120000, # 2 minutes
Missing Content
- Check if content is in
<main>or common content tags - Add custom selectors to
CORE_CONTENT_SELECTORS - Run with
headless: Falseto debug visually
Image Download Failures
- Verify image URLs are accessible
- Check if images require authentication
- Some CDNs may block automated downloads
📋 Dependencies
Automatically installed via pip:
- playwright - Browser automation and JS rendering
- beautifulsoup4 - HTML parsing and manipulation
- lxml - Fast XML/HTML parser
- markdownify - HTML to Markdown conversion
- urllib3 - HTTP client utilities
🤝 Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Run tests (if available)
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
git clone https://github.com/floatinghotpot/web2md.git
cd web2md
python3 -m pip install -e .
python3 -m playwright install chromium
📝 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Playwright for powerful browser automation
- BeautifulSoup for HTML parsing
- markdownify for clean Markdown conversion
Made with ❤️ for developers, researchers, and documentation enthusiasts.
If you find this tool useful, please consider giving it a ⭐ on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web2md-0.1.0.tar.gz.
File metadata
- Download URL: web2md-0.1.0.tar.gz
- Upload date:
- Size: 37.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03940fc9b9511045bd488ea5efba2d697dc48eaa5258fb50f74d906f7608182f
|
|
| MD5 |
9150fd0ef505041f4846ce63b24dbf5a
|
|
| BLAKE2b-256 |
f65332386607fc5ae66ec1c95e447dae260c0d7a026e8e4064f687392fe4dea8
|
File details
Details for the file web2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: web2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09c1a42ada1c4947cb7e0b69e7b197d7348dc40e5593807886fcd0b6637104dd
|
|
| MD5 |
de2e81105cce21e1a7928adedaf8893d
|
|
| BLAKE2b-256 |
3e94f7c2782f4db15e2e3c20ba7a6fa6cac5d5b1cf2761b31ea152a13f09c407
|