A Python library for crawling web pages and converting them to markdown format

These details have not been verified by PyPI

Project links

Project description

Web2Markdown

By Roger Sindreu

A Python library for crawling web pages and converting them to markdown format. This tool is particularly useful for creating context files for Large Language Models (LLMs), especially when working with new AI frameworks or technologies where documentation is constantly evolving. It helps you easily convert web documentation, blogs, or any web content into clean markdown files while preserving the content structure.

Features

Depth-limited web crawling
Domain/path boundary respect
Smart content extraction using trafilatura
Clean HTML to Markdown conversion
Metadata extraction
Command-line interface with configurable options

Installation

pip install web2markdown

Usage

Command Line Interface

# Basic usage
web2markdown -u https://example.com/docs -o output.md

# Specify crawl depth
web2markdown -u https://example.com/docs -d 2 -o output.md

# Enable verbose logging
web2markdown -u https://example.com/docs -v

Python API

from web2markdown import WebCrawler, MarkdownConverter

# Initialize crawler
crawler = WebCrawler(base_url="https://example.com/docs", max_depth=3)

# Crawl pages
pages = crawler.crawl()

# Convert to markdown
converter = MarkdownConverter()
markdown_content = converter.convert_to_markdown(pages)

# Save the result
converter.save_markdown(markdown_content, "output.md")

Using with LLMs through the UI

You can use Web2Markdown by manually downloading the documentation pages as markdown and then manually attaching them to your conversation with the LLM.

Using with LLMs (Programmatically)

Web2Markdown is particularly valuable when you need to provide up-to-date context to LLMs about new frameworks or technologies. Here's how you can use it with popular LLM APIs:

from web2markdown import WebCrawler, MarkdownConverter
from openai import OpenAI  # or import google.generativeai as genai

# Download latest documentation
crawler = WebCrawler("https://python.langchain.com/docs/expression_language/", max_depth=2)
pages = crawler.crawl()

converter = MarkdownConverter()
markdown_content = converter.convert_to_markdown(pages)
converter.save_markdown(markdown_content, "langgraph.md")

# Use with OpenAI
client = OpenAI()
with open("langgraph.md", "r") as f:
    context = f.read()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an AI expert. Use the context provided to answer questions."},
        {"role": "user", "content": f"Context: {context}\n\nQuestion: How do I create a simple LangGraph?"}]
)

# Or use with Google's Gemini
'''
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-pro")

with open("langgraph.md", "r") as f:
    context = f.read()

response = model.generate_content(
    f"Context: {context}\n\nQuestion: How do I create a simple LangGraph?"
)
'''

Command Line Options

-u, --url: Base URL to crawl (required)
-d, --depth: Maximum crawl depth (default: 3)
-o, --output: Output markdown file path (default: output.md)
-v, --verbose: Enable verbose logging

Dependencies

beautifulsoup4>=4.12.0
requests>=2.31.0
urllib3>=2.1.0
markdown>=3.5.0
trafilatura>=0.8.1

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Mar 18, 2025

0.1.0

Mar 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2markdown-0.1.1.tar.gz (10.3 kB view details)

Uploaded Mar 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web2markdown-0.1.1-py3-none-any.whl (10.6 kB view details)

Uploaded Mar 18, 2025 Python 3

File details

Details for the file web2markdown-0.1.1.tar.gz.

File metadata

Download URL: web2markdown-0.1.1.tar.gz
Upload date: Mar 18, 2025
Size: 10.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for web2markdown-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`781b1aa5d5aa4b6c626236a3ace504e3b77b17d7af088675e2604d88cd08dbec`
MD5	`5db0098e4a6503987d8bedb2ec02ac20`
BLAKE2b-256	`3b7f70c9d362448e1b37cf31a24e0fb6c0fb52451fc402bc2439b76f1c4e2cb8`

See more details on using hashes here.

File details

Details for the file web2markdown-0.1.1-py3-none-any.whl.

File metadata

Download URL: web2markdown-0.1.1-py3-none-any.whl
Upload date: Mar 18, 2025
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for web2markdown-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`626fe506bc699b7df7b3b0d0767a502a5bd13343ff448c7fd450afd4a23dfa08`
MD5	`97f0458b4ba9f8a6c8ab5ab7f1536037`
BLAKE2b-256	`b35525cc62da655220f26efaa013ff96312c4be3dc0835310bcdbc5a1c3d6d86`

See more details on using hashes here.

web2markdown 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web2Markdown

Features

Installation

Usage

Command Line Interface

Python API

Using with LLMs through the UI

Using with LLMs (Programmatically)

Command Line Options

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes