Skip to main content

A CLI tool to crawl dynamic/static websites and convert content to clean Markdown

Project description

web2md

License: MIT Python 3.8+

A powerful, intelligent CLI tool to crawl dynamic and static websites with full JavaScript rendering support and convert them to clean, well-formatted Markdown files. Perfect for archiving documentation, creating offline knowledge bases, and preserving web content.

✨ Key Features

  • 🚀 Dynamic Site Support: Full JavaScript rendering via Playwright (Vue/React/Angular/Next.js)
  • 🎯 Smart Content Extraction: Automatically identifies and extracts core content, removing navigation, ads, and sidebars
  • 🔗 Recursive Crawling: Intelligently crawls subpages with configurable depth and count limits
  • �️ Media Downloads: Optional image and video downloading with lazy-loading support
  • 📐 Base URL Intelligence: Uses browser's document.baseURI for accurate relative path resolution
  • 🔄 Local Link Conversion: Automatically converts HTML links to local Markdown relative paths
  • 🧹 Clean Output: Preserves tables, code blocks, images, links, and heading hierarchies
  • 🔒 SSL Flexibility: Handles sites with certificate issues gracefully
  • 🌍 Cross-Platform: Works on Windows, macOS, and Linux (Python 3.8+)
  • 📋 Universal Compatibility: Generated Markdown works with Typora, Obsidian, VS Code, and more

📦 Installation

Option 1: Install from PyPI (Recommended)

pip3 install web2md

Option 2: Install from Source (For Development)

git clone https://github.com/floatinghotpot/web2md.git
cd web2md
python3 -m pip install -e .

Required: Install Playwright Browser

# Install Chromium driver (required for JavaScript rendering)
python3 -m playwright install chromium

# Linux only: Install system dependencies
python3 -m playwright install-deps chromium

🚀 Quick Start

Basic Usage

# Crawl a single page (auto-generated save directory)
web2md https://docs.python.org/3/tutorial/

# Specify custom save directory
web2md https://docs.python.org/3/tutorial/ ./python-docs

# Crawl with images
web2md https://example.com/docs --picture

# Limit crawl depth and count
web2md https://example.com/docs --depth 2 --count 10

# Crawl with images and videos
web2md https://example.com/docs --picture --video --depth 3

Show Help

web2md -h

📖 Usage

Command Syntax

web2md [URL] [SAVE_DIR] [OPTIONS]

Arguments

Argument Required Description
web_url ✅ Yes Target webpage URL (must start with http/https)
save_folder ❌ No Local save directory (auto-generated from URL if omitted)

Options

Option Default Description
--depth N 5 Maximum relative crawl depth from base URL
--count N 999 Maximum number of pages to crawl (0 = unlimited)
--picture False Download and save images to local images/ directory
--video False Download and save videos to local videos/ directory
-h, --help - Show help message and exit

Examples

1. Unlimited Crawl with Depth Limit

web2md https://company.com/docs/home company-docs --depth 2
  • Crawls all pages within 2 levels of /docs/
  • Saves to ./company-docs/

2. Limited Page Count

web2md https://company.com/docs/home company-docs --depth 2 --count 5
  • Stops after crawling 5 pages
  • Useful for testing or sampling large sites

3. Crawl with Images

web2md https://company.com/docs/home --picture --count 3
  • Downloads images to images/ subdirectory
  • Converts image URLs to local relative paths in Markdown

4. Auto-Generated Save Directory

web2md https://company.com/docs/home --depth 1 --count 10
  • Auto-creates directory: company_com_docs/

🎯 How It Works

1. Base URL Calculation

The tool automatically determines a base URL from your target URL:

  • Target: https://company.com/docs/home → Base: https://company.com/docs/
  • All crawling is scoped to pages under this base URL

2. Intelligent Path Resolution

Uses the browser's document.baseURI to correctly resolve relative URLs:

  • Handles <base> tags in HTML
  • Respects redirects and trailing slashes
  • Resolves lazy-loaded images with data-src, srcset, etc.

3. Smart Content Extraction

Automatically identifies core content using priority selectors:

  1. <main> tag
  2. .article-content or .article_content
  3. #main-content
  4. .content
  5. <article> tag
  6. Fallback to <body> (with cleanup)

4. Media Handling

When --picture or --video is enabled:

  • Downloads media files to images/ or videos/ subdirectories
  • Generates unique filenames with MD5 hash to prevent duplicates
  • Converts URLs to local relative paths in Markdown
  • Supports lazy-loading attributes: data-src, data-original, srcset

5. Filename Generation

MD filenames are generated from URLs:

  • Remove base URL prefix
  • Replace / with _
  • Filter illegal characters
  • Example: https://company.com/docs/api/authapi_auth.md

⚙️ Configuration

Built-in Settings (in web2md/cli.py)

Playwright Configuration

PLAYWRIGHT_CONFIG = {
    "headless": False,           # Set to True for background crawling
    "timeout": 60000,            # Page load timeout (ms)
    "wait_for_load": "networkidle",  # Wait strategy
    "sleep_after_load": 2,       # Additional wait time (seconds)
    "user_agent": "Mozilla/5.0..." # Custom user agent
}

Media Configuration

MEDIA_CONFIG = {
    "timeout": 30000,            # Media download timeout (ms)
    "image_dir": "images",       # Image save subdirectory
    "video_dir": "videos",       # Video save subdirectory
    "allowed_img_ext": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".svg", ".webp"],
    "allowed_vid_ext": [".mp4", ".avi", ".mov", ".webm", ".flv", ".mkv"]
}

Content Filtering

REMOVE_TAGS = ["nav", "header", "footer", "aside", "script", "style", "iframe", "sidebar"]

CORE_CONTENT_SELECTORS = [
    ("main", {}),
    ("div", {"class_": "article-content"}),
    ("article", {})
]

Crawl Defaults

DEFAULT_CRAWL_CONFIG = {
    "max_depth": 5,              # Default max depth
    "max_count": 999,            # Default max pages
    "allowed_schemes": ["http", "https"],
    "exclude_patterns": [r"\.pdf$", r"\.zip$", r"\.exe$"]
}

🔧 Advanced Usage

Debug Mode (Show Browser)

Edit web2md/cli.py and set:

PLAYWRIGHT_CONFIG = {
    "headless": False,  # Shows browser window
    ...
}

Custom Content Selectors

Add site-specific selectors to CORE_CONTENT_SELECTORS:

CORE_CONTENT_SELECTORS = [
    ("main", {}),
    ("div", {"class_": "documentation-content"}),  # Custom selector
    ("article", {})
]

Anti-Bot Detection

Install and use playwright-stealth:

pip3 install playwright-stealth

Add to get_dynamic_html() in web2md/cli.py:

from playwright_stealth import stealth_sync

page = context.new_page()
stealth_sync(page)  # Add this line
page.goto(url, ...)

Authentication

Add login logic in get_dynamic_html() before page.goto():

page.goto("https://example.com/login")
page.fill("#username", "your-username")
page.fill("#password", "your-password")
page.click("#login-button")
time.sleep(2)

🐛 Troubleshooting

SSL Certificate Errors

The tool automatically disables SSL verification for downloads. If you encounter issues, check your network/firewall settings.

Timeout Errors

Increase timeout in PLAYWRIGHT_CONFIG:

"timeout": 120000,  # 2 minutes

Missing Content

  1. Check if content is in <main> or common content tags
  2. Add custom selectors to CORE_CONTENT_SELECTORS
  3. Run with headless: False to debug visually

Image Download Failures

  • Verify image URLs are accessible
  • Check if images require authentication
  • Some CDNs may block automated downloads

📋 Dependencies

Automatically installed via pip:

  • playwright - Browser automation and JS rendering
  • beautifulsoup4 - HTML parsing and manipulation
  • lxml - Fast XML/HTML parser
  • markdownify - HTML to Markdown conversion
  • urllib3 - HTTP client utilities

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests (if available)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Development Setup

git clone https://github.com/floatinghotpot/web2md.git
cd web2md
python3 -m pip install -e .
python3 -m playwright install chromium

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments


Made with ❤️ for developers, researchers, and documentation enthusiasts.

If you find this tool useful, please consider giving it a ⭐ on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2md-0.1.0.tar.gz (37.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web2md-0.1.0-py3-none-any.whl (33.7 kB view details)

Uploaded Python 3

File details

Details for the file web2md-0.1.0.tar.gz.

File metadata

  • Download URL: web2md-0.1.0.tar.gz
  • Upload date:
  • Size: 37.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for web2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 03940fc9b9511045bd488ea5efba2d697dc48eaa5258fb50f74d906f7608182f
MD5 9150fd0ef505041f4846ce63b24dbf5a
BLAKE2b-256 f65332386607fc5ae66ec1c95e447dae260c0d7a026e8e4064f687392fe4dea8

See more details on using hashes here.

File details

Details for the file web2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: web2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for web2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 09c1a42ada1c4947cb7e0b69e7b197d7348dc40e5593807886fcd0b6637104dd
MD5 de2e81105cce21e1a7928adedaf8893d
BLAKE2b-256 3e94f7c2782f4db15e2e3c20ba7a6fa6cac5d5b1cf2761b31ea152a13f09c407

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page