Skip to main content

🚀 Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

Project description

🚀 url2md4ai

Python License uv Trafilatura Playwright

🎯 A lean Python tool for extracting clean, LLM-optimized markdown from web pages.

Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. It combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering clean markdown ready for LLM processing.

🎯 Why url2md4ai?

Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.

# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata

# Result: 97% noise reduction (from 51KB to 9KB)
# ✅ Clean job title, description, requirements, benefits
# ❌ No cookie banners, ads, or navigation clutter

Perfect for:

  • 🤖 AI content analysis workflows
  • 📊 LLM-based information extraction
  • 🔍 Web scraping for research and analysis
  • 📝 Content preprocessing for RAG systems
  • 🎯 Automated content monitoring

✨ Features

  • 🧠 Smart Content Extraction: Powered by trafilatura for intelligent text extraction from HTML.
  • 🚀 Dynamic Content Support: Uses playwright to render JavaScript on web pages, ensuring content from SPAs and dynamic sites is captured.
  • 🧹 Clean Output: Removes ads, cookie banners, navigation, and other noise for a cleaner final output.
  • 🐍 Simple API: A straightforward Python API and CLI for easy integration into your workflows.
  • 📁 Deterministic Filenames: Generates unique, hash-based filenames from URLs for consistent output.

Lean & Efficient

  • 🎯 Focused Purpose: Built specifically for AI/LLM text extraction workflows
  • ⚡ Fast Processing: Optional non-JavaScript mode for static content (3x faster)
  • 🔧 CLI-First: Simple command-line interface for batch processing and automation
  • 🐍 Python API: Clean programmatic access for integration into AI pipelines

🛠️ Production Ready

  • 📁 Smart Filenames: Generate unique, deterministic filenames using URL hashes
  • 🔄 Batch Processing: Parallel processing support for multiple URLs
  • 🎛️ Configurable: Extensive configuration options for different content types
  • 📈 Reliable: Built-in retry logic and error handling

🚀 Quick Start

Using uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync

# Install Playwright browsers
uv run playwright install chromium

# Convert your first URL
uv run url2md4ai convert "https://example.com"

Using pip

pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"

Using Docker

See DOCKER_USAGE.md for instructions on how to use the provided Docker setup.

📖 Usage

Command-Line Interface (CLI)

The CLI provides a simple way to convert URLs to markdown or extract raw HTML.

Convert a URL to Markdown

# Convert a single URL and print to console
url2md4ai convert "https://example.com" --no-save

# Save the markdown to the default 'output' directory
url2md4ai convert "https://example.com"

# Specify a custom output directory
url2md4ai convert "https://example.com" --output-dir my_markdown

Extract Raw HTML from a URL

# Get the raw HTML of a page and print it to the console
url2md4ai extract-html "https://example.com"

Convert a Local HTML File

# Convert a local HTML file to markdown
url2md4ai convert-html my_page.html

For more options, use the --help flag with any command:

url2md4ai convert --help

Python API

The Python API provides programmatic access to the content extraction functionality.

import asyncio
from url2md4ai import ContentExtractor

# Initialize the extractor
extractor = ContentExtractor()

async def main():
    url = "https://example.com"

    # Extract clean markdown from a URL
    markdown_result = await extractor.extract_markdown(url)
    if markdown_result:
        print("--- MARKDOWN ---")
        print(markdown_result["markdown"])
        print(f"\\nSaved to: {markdown_result['output_path']}")

    # Extract raw HTML from a URL
    html_content = await extractor.extract_html(url)
    if html_content:
        print("\\n--- HTML ---")
        print(html_content[:200] + "...")  # Print first 200 characters

asyncio.run(main())

Synchronous Usage

For use cases where you can't use asyncio, synchronous wrappers are available:

from url2md4ai import ContentExtractor

extractor = ContentExtractor()
url = "https://example.com"

# Synchronously extract markdown
markdown_result = extractor.extract_markdown_sync(url)
if markdown_result:
    print(markdown_result["markdown"])

# Synchronously extract HTML
html_content = extractor.extract_html_sync(url)
if html_content:
    print(html_content[:200] + "...")

🛠️ Configuration

The behavior of the ContentExtractor can be customized through a Config object or environment variables.

Example: Custom Configuration

from url2md4ai import ContentExtractor, Config

# Customize configuration
config = Config(
    timeout=60,                  # Page load timeout in seconds
    user_agent="MyTestAgent/1.0", # Custom User-Agent
    output_dir="custom_output",  # Default output directory
    browser_headless=True,       # Run Playwright in headless mode
    wait_for_network_idle=True,  # Wait for network to be idle
    page_wait_timeout=2000       # Additional wait time in ms
)

extractor = ContentExtractor(config=config)

# This will use the custom configuration
extractor.extract_markdown_sync("https://example.com")

See src/url2md4ai/config.py for all available configuration options and their corresponding environment variables.

🤝 Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

📊 Extraction Quality Examples

Before vs After: Real-World Results

# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata

Before (Raw HTML): 51KB, 797 lines

  • ❌ Cookie consent banners
  • ❌ Website navigation
  • ❌ Social media widgets
  • ❌ Advertising content
  • ❌ Footer links and legal text

After (url2md4ai): 9KB, 69 lines

  • ✅ Job title and description
  • ✅ Key requirements
  • ✅ Company benefits
  • ✅ Application process
  • 97% noise reduction!

Content Types Optimized for LLM

Content Type Extraction Quality Best Settings
News Articles ⭐⭐⭐⭐⭐ --no-js (faster)
Job Postings ⭐⭐⭐⭐⭐ --force-js (complete)
Product Pages ⭐⭐⭐⭐ --clean (essential)
Documentation ⭐⭐⭐⭐⭐ --raw (preserve structure)
Blog Posts ⭐⭐⭐⭐⭐ default settings
Social Media ⭐⭐⭐ --force-js required

📈 Roadmap

  • Support for more output formats (PDF, DOCX)
  • Custom CSS selector filtering
  • Integration with popular LLM APIs
  • Web UI interface
  • Plugin system for custom processors
  • Support for authentication-required pages

Made with ❤️ by Saverio Mazza

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url2md4ai-0.1.2.tar.gz (255.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

url2md4ai-0.1.2-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file url2md4ai-0.1.2.tar.gz.

File metadata

  • Download URL: url2md4ai-0.1.2.tar.gz
  • Upload date:
  • Size: 255.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4f6ef57a7eb911c22497ad8d5fb4f1b932738740f00b8a6d3e1785d751094120
MD5 27cc1678ae10b2c5c3624954f6058fd3
BLAKE2b-256 51924a5cbaf0d0e097e7ac91459d38adc504b1ed5113033c41beaf099f2fc2c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.1.2.tar.gz:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file url2md4ai-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: url2md4ai-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 48fa63d5bf6e877ea1870de1c7596730f31412e7c4f3dbea7324c8e056f85786
MD5 260777f014bd1b5df04d54c90d64f71a
BLAKE2b-256 bb9987bcaf751a6c3e1f78ea672af3844768d7f3c3d823bd79f1de44bf50310f

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.1.2-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page