Skip to main content

A Python library for crawling web pages and converting them to markdown format

Project description

Web2Markdown

By Roger Sindreu

A Python library for crawling web pages and converting them to markdown format. This tool is particularly useful for creating context files for Large Language Models (LLMs), especially when working with new AI frameworks or technologies where documentation is constantly evolving. It helps you easily convert web documentation, blogs, or any web content into clean markdown files while preserving the content structure.

Features

  • Depth-limited web crawling
  • Domain/path boundary respect
  • Smart content extraction using trafilatura
  • Clean HTML to Markdown conversion
  • Metadata extraction
  • Command-line interface with configurable options

Installation

pip install web2markdown

Usage

Command Line Interface

# Basic usage
web2markdown -u https://example.com/docs -o output.md

# Specify crawl depth
web2markdown -u https://example.com/docs -d 2 -o output.md

# Enable verbose logging
web2markdown -u https://example.com/docs -v

Python API

from web2markdown import WebCrawler, MarkdownConverter

# Initialize crawler
crawler = WebCrawler(base_url="https://example.com/docs", max_depth=3)

# Crawl pages
pages = crawler.crawl()

# Convert to markdown
converter = MarkdownConverter()
markdown_content = converter.convert_to_markdown(pages)

# Save the result
converter.save_markdown(markdown_content, "output.md")

Using with LLMs through the UI

You can use Web2Markdown by manually downloading the documentation pages as markdown and then manually attaching them to your conversation with the LLM.

Using with LLMs (Programmatically)

Web2Markdown is particularly valuable when you need to provide up-to-date context to LLMs about new frameworks or technologies. Here's how you can use it with popular LLM APIs:

from web2markdown import WebCrawler, MarkdownConverter
from openai import OpenAI  # or import google.generativeai as genai

# Download latest documentation
crawler = WebCrawler("https://python.langchain.com/docs/expression_language/", max_depth=2)
pages = crawler.crawl()

converter = MarkdownConverter()
markdown_content = converter.convert_to_markdown(pages)
converter.save_markdown(markdown_content, "langgraph.md")

# Use with OpenAI
client = OpenAI()
with open("langgraph.md", "r") as f:
    context = f.read()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI expert. Use the context provided to answer questions."},
        {"role": "user", "content": f"Context: {context}\n\nQuestion: How do I create a simple LangGraph?"}]
)

# Or use with Google's Gemini
'''
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-pro")

with open("langgraph.md", "r") as f:
    context = f.read()

response = model.generate_content(
    f"Context: {context}\n\nQuestion: How do I create a simple LangGraph?"
)
'''

Command Line Options

  • -u, --url: Base URL to crawl (required)
  • -d, --depth: Maximum crawl depth (default: 3)
  • -o, --output: Output markdown file path (default: output.md)
  • -v, --verbose: Enable verbose logging

Dependencies

  • beautifulsoup4>=4.12.0
  • requests>=2.31.0
  • urllib3>=2.1.0
  • markdown>=3.5.0
  • trafilatura>=0.8.1

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2markdown-0.1.1.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web2markdown-0.1.1-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file web2markdown-0.1.1.tar.gz.

File metadata

  • Download URL: web2markdown-0.1.1.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for web2markdown-0.1.1.tar.gz
Algorithm Hash digest
SHA256 781b1aa5d5aa4b6c626236a3ace504e3b77b17d7af088675e2604d88cd08dbec
MD5 5db0098e4a6503987d8bedb2ec02ac20
BLAKE2b-256 3b7f70c9d362448e1b37cf31a24e0fb6c0fb52451fc402bc2439b76f1c4e2cb8

See more details on using hashes here.

File details

Details for the file web2markdown-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: web2markdown-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for web2markdown-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 626fe506bc699b7df7b3b0d0767a502a5bd13343ff448c7fd450afd4a23dfa08
MD5 97f0458b4ba9f8a6c8ab5ab7f1536037
BLAKE2b-256 b35525cc62da655220f26efaa013ff96312c4be3dc0835310bcdbc5a1c3d6d86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page