A CLI tool to crawl dynamic/static websites and convert content to clean Markdown

These details have not been verified by PyPI

Project links

Homepage

Project description

web2md

A powerful, intelligent CLI tool to crawl dynamic and static websites with full JavaScript rendering support and convert them to clean, well-formatted Markdown files. Perfect for archiving documentation, creating offline knowledge bases, and preserving web content.

✨ Key Features

🚀 Dynamic Site Support: Full JavaScript rendering via Playwright (Vue/React/Angular/Next.js)
🎯 Smart Content Extraction: Automatically identifies and extracts core content, removing navigation, ads, and sidebars
🔗 Recursive Crawling: Intelligently crawls subpages with configurable depth and count limits
�️ Media Downloads: Optional image and video downloading with lazy-loading support
📐 Base URL Intelligence: Uses browser's document.baseURI for accurate relative path resolution
🔄 Local Link Conversion: Automatically converts HTML links to local Markdown relative paths
🧹 Clean Output: Preserves tables, code blocks, images, links, and heading hierarchies
🔒 SSL Flexibility: Handles sites with certificate issues gracefully
🌍 Cross-Platform: Works on Windows, macOS, and Linux (Python 3.8+)
📋 Universal Compatibility: Generated Markdown works with Typora, Obsidian, VS Code, and more

📦 Installation

Option 1: Install from PyPI (Recommended)

pip3 install web2md

Option 2: Install from Source (For Development)

git clone https://github.com/floatinghotpot/web2md.git
cd web2md
python3 -m pip install -e .

Required: Install Playwright Browser

# Install Chromium driver (required for JavaScript rendering)
python3 -m playwright install chromium

# Linux only: Install system dependencies
python3 -m playwright install-deps chromium

🚀 Quick Start

Basic Usage

# Crawl a single page (auto-generated save directory)
web2md https://docs.python.org/3/tutorial/

# Specify custom save directory
web2md https://docs.python.org/3/tutorial/ ./python-docs

# Crawl with images
web2md https://example.com/docs --picture

# Limit crawl depth and count
web2md https://example.com/docs --depth 2 --count 10

# Crawl with images and videos
web2md https://example.com/docs --picture --video --depth 3

Show Help

web2md -h

📖 Usage

Command Syntax

web2md [URL] [SAVE_DIR] [OPTIONS]

Arguments

Argument	Required	Description
`web_url`	✅ Yes	Target webpage URL (must start with http/https)
`save_folder`	❌ No	Local save directory (auto-generated from URL if omitted)

Options

Option	Default	Description
`--depth N`	`5`	Maximum relative crawl depth from base URL
`--count N`	`999`	Maximum number of pages to crawl (0 = unlimited)
`--picture`	`False`	Download and save images to local `images/` directory
`--video`	`False`	Download and save videos to local `videos/` directory
`-h, --help`	-	Show help message and exit

Examples

1. Unlimited Crawl with Depth Limit

web2md https://company.com/docs/home company-docs --depth 2

Crawls all pages within 2 levels of /docs/
Saves to ./company-docs/

2. Limited Page Count

web2md https://company.com/docs/home company-docs --depth 2 --count 5

Stops after crawling 5 pages
Useful for testing or sampling large sites

3. Crawl with Images

web2md https://company.com/docs/home --picture --count 3

Downloads images to images/ subdirectory
Converts image URLs to local relative paths in Markdown

4. Auto-Generated Save Directory

web2md https://company.com/docs/home --depth 1 --count 10

Auto-creates directory: company_com_docs/

🎯 How It Works

1. Base URL Calculation

The tool automatically determines a base URL from your target URL:

Target: https://company.com/docs/home → Base: https://company.com/docs/
All crawling is scoped to pages under this base URL

2. Intelligent Path Resolution

Uses the browser's document.baseURI to correctly resolve relative URLs:

Handles <base> tags in HTML
Respects redirects and trailing slashes
Resolves lazy-loaded images with data-src, srcset, etc.

3. Smart Content Extraction

Automatically identifies core content using priority selectors:

<main> tag
.article-content or .article_content
#main-content
.content
<article> tag
Fallback to <body> (with cleanup)

4. Media Handling

When --picture or --video is enabled:

Downloads media files to images/ or videos/ subdirectories
Generates unique filenames with MD5 hash to prevent duplicates
Converts URLs to local relative paths in Markdown
Supports lazy-loading attributes: data-src, data-original, srcset

5. Filename Generation

MD filenames are generated from URLs:

Remove base URL prefix
Replace / with _
Filter illegal characters
Example: https://company.com/docs/api/auth → api_auth.md

⚙️ Configuration

Built-in Settings (in `web2md/cli.py`)

Playwright Configuration

PLAYWRIGHT_CONFIG = {
    "headless": False,           # Set to True for background crawling
    "timeout": 60000,            # Page load timeout (ms)
    "wait_for_load": "networkidle",  # Wait strategy
    "sleep_after_load": 2,       # Additional wait time (seconds)
    "user_agent": "Mozilla/5.0..." # Custom user agent
}

Media Configuration

MEDIA_CONFIG = {
    "timeout": 30000,            # Media download timeout (ms)
    "image_dir": "images",       # Image save subdirectory
    "video_dir": "videos",       # Video save subdirectory
    "allowed_img_ext": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp"],
    "allowed_vid_ext": [".mp4", ".avi", ".mov", ".webm", ".flv", ".mkv"]
}

Content Filtering

REMOVE_TAGS = ["nav", "header", "footer", "aside", "script", "style", "iframe", "sidebar"]

CORE_CONTENT_SELECTORS = [
    ("main", {}),
    ("div", {"class_": "article-content"}),
    ("article", {})
]

Crawl Defaults

DEFAULT_CRAWL_CONFIG = {
    "max_depth": 5,              # Default max depth
    "max_count": 999,            # Default max pages
    "allowed_schemes": ["http", "https"],
    "exclude_patterns": [r"\.pdf$", r"\.zip$", r"\.exe$"]
}

🔧 Advanced Usage

Debug Mode (Show Browser)

Edit web2md/cli.py and set:

PLAYWRIGHT_CONFIG = {
    "headless": False,  # Shows browser window
    ...
}

Custom Content Selectors

Add site-specific selectors to CORE_CONTENT_SELECTORS:

CORE_CONTENT_SELECTORS = [
    ("main", {}),
    ("div", {"class_": "documentation-content"}),  # Custom selector
    ("article", {})
]

Anti-Bot Detection

Install and use playwright-stealth:

pip3 install playwright-stealth

Add to get_dynamic_html() in web2md/cli.py:

from playwright_stealth import stealth_sync

page = context.new_page()
stealth_sync(page)  # Add this line
page.goto(url, ...)

Authentication

Add login logic in get_dynamic_html() before page.goto():

page.goto("https://example.com/login")
page.fill("#username", "your-username")
page.fill("#password", "your-password")
page.click("#login-button")
time.sleep(2)

🐛 Troubleshooting

SSL Certificate Errors

The tool automatically disables SSL verification for downloads. If you encounter issues, check your network/firewall settings.

Timeout Errors

Increase timeout in PLAYWRIGHT_CONFIG:

"timeout": 120000,  # 2 minutes

Missing Content

Check if content is in <main> or common content tags
Add custom selectors to CORE_CONTENT_SELECTORS
Run with headless: False to debug visually

Image Download Failures

Verify image URLs are accessible
Check if images require authentication
Some CDNs may block automated downloads

📋 Dependencies

Automatically installed via pip:

playwright - Browser automation and JS rendering
beautifulsoup4 - HTML parsing and manipulation
lxml - Fast XML/HTML parser
markdownify - HTML to Markdown conversion
urllib3 - HTTP client utilities

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests (if available)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

git clone https://github.com/floatinghotpot/web2md.git
cd web2md
python3 -m pip install -e .
python3 -m playwright install chromium

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Playwright for powerful browser automation
BeautifulSoup for HTML parsing
markdownify for clean Markdown conversion

Made with ❤️ for developers, researchers, and documentation enthusiasts.

If you find this tool useful, please consider giving it a ⭐ on GitHub!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.2

Feb 27, 2026

0.2.0

Jan 28, 2026

This version

0.1.0

Jan 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2md-0.1.0.tar.gz (37.3 kB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web2md-0.1.0-py3-none-any.whl (33.7 kB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file web2md-0.1.0.tar.gz.

File metadata

Download URL: web2md-0.1.0.tar.gz
Upload date: Jan 28, 2026
Size: 37.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for web2md-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`03940fc9b9511045bd488ea5efba2d697dc48eaa5258fb50f74d906f7608182f`
MD5	`9150fd0ef505041f4846ce63b24dbf5a`
BLAKE2b-256	`f65332386607fc5ae66ec1c95e447dae260c0d7a026e8e4064f687392fe4dea8`

See more details on using hashes here.

File details

Details for the file web2md-0.1.0-py3-none-any.whl.

File metadata

Download URL: web2md-0.1.0-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 33.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for web2md-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09c1a42ada1c4947cb7e0b69e7b197d7348dc40e5593807886fcd0b6637104dd`
MD5	`de2e81105cce21e1a7928adedaf8893d`
BLAKE2b-256	`3e94f7c2782f4db15e2e3c20ba7a6fa6cac5d5b1cf2761b31ea152a13f09c407`

See more details on using hashes here.

web2md 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

web2md

✨ Key Features

📦 Installation

Option 1: Install from PyPI (Recommended)

Option 2: Install from Source (For Development)

Required: Install Playwright Browser

🚀 Quick Start

Basic Usage

Show Help

📖 Usage

Command Syntax

Arguments

Options

Examples

1. Unlimited Crawl with Depth Limit

2. Limited Page Count

3. Crawl with Images

4. Auto-Generated Save Directory

🎯 How It Works

1. Base URL Calculation

2. Intelligent Path Resolution

3. Smart Content Extraction

4. Media Handling

5. Filename Generation

⚙️ Configuration

Built-in Settings (in web2md/cli.py)

Playwright Configuration

Media Configuration

Content Filtering

Crawl Defaults

🔧 Advanced Usage

Debug Mode (Show Browser)

Custom Content Selectors

Anti-Bot Detection

Authentication

🐛 Troubleshooting

SSL Certificate Errors

Timeout Errors

Missing Content

Image Download Failures

📋 Dependencies

🤝 Contributing

Development Setup

📝 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Built-in Settings (in `web2md/cli.py`)