🚀 Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

These details have not been verified by PyPI

Project description

🚀 url2md4ai

Python License Trafilatura Playwright

🎯 A lean Python tool for extracting clean, LLM-optimized markdown from web pages.

Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. It combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering clean markdown ready for LLM processing.

🎯 Why url2md4ai?

Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.

# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata

# Result: 97% noise reduction (from 51KB to 9KB)
# ✅ Clean job title, description, requirements, benefits
# ❌ No cookie banners, ads, or navigation clutter

Perfect for:

🤖 AI content analysis workflows
📊 LLM-based information extraction
🔍 Web scraping for research and analysis
📝 Content preprocessing for RAG systems
🎯 Automated content monitoring

✨ Features

🧠 Smart Content Extraction: Powered by trafilatura for intelligent text extraction from HTML.
🚀 Dynamic Content Support: Uses playwright to render JavaScript on web pages, ensuring content from SPAs and dynamic sites is captured.
🧹 Clean Output: Removes ads, cookie banners, navigation, and other noise for a cleaner final output.
🐍 Simple API: A straightforward Python API and CLI for easy integration into your workflows.
📁 Deterministic Filenames: Generates unique, hash-based filenames from URLs for consistent output.

⚡ Lean & Efficient

🎯 Focused Purpose: Built specifically for AI/LLM text extraction workflows
⚡ Fast Processing: Optional non-JavaScript mode for static content (3x faster)
🔧 CLI-First: Simple command-line interface for batch processing and automation
🐍 Python API: Clean programmatic access for integration into AI pipelines

🛠️ Production Ready

📁 Smart Filenames: Generate unique, deterministic filenames using URL hashes
🔄 Batch Processing: Parallel processing support for multiple URLs
🎛️ Configurable: Extensive configuration options for different content types
📈 Reliable: Built-in retry logic and error handling

🚀 Quick Start

Using `uv` (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync

# Install Playwright browsers
uv run playwright install chromium

# Convert your first URL
uv run url2md4ai convert "https://example.com"

Using `pip`

pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"

Using Docker

See DOCKER_USAGE.md for instructions on how to use the provided Docker setup.

📖 Usage

Command-Line Interface (CLI)

The CLI provides a simple way to convert URLs to markdown or extract raw HTML.

Convert a URL to Markdown

# Convert a single URL and print to console
url2md4ai convert "https://example.com" --no-save

# Save the markdown to the default 'output' directory
url2md4ai convert "https://example.com"

# Specify a custom output directory
url2md4ai convert "https://example.com" --output-dir my_markdown

Extract Raw HTML from a URL

# Get the raw HTML of a page and print it to the console
url2md4ai extract-html "https://example.com"

Convert a Local HTML File

# Convert a local HTML file to markdown
url2md4ai convert-html my_page.html

For more options, use the --help flag with any command:

url2md4ai convert --help

Python API

The Python API provides programmatic access to the content extraction functionality.

import asyncio
from url2md4ai import ContentExtractor

# Initialize the extractor
extractor = ContentExtractor()

async def main():
    url = "https://example.com"

    # Extract clean markdown from a URL
    markdown_result = await extractor.extract_markdown(url)
    if markdown_result:
        print("--- MARKDOWN ---")
        print(markdown_result["markdown"])
        print(f"\\nSaved to: {markdown_result['output_path']}")

    # Extract raw HTML from a URL
    html_content = await extractor.extract_html(url)
    if html_content:
        print("\\n--- HTML ---")
        print(html_content[:200] + "...")  # Print first 200 characters

asyncio.run(main())

Synchronous Usage

For use cases where you can't use asyncio, synchronous wrappers are available:

from url2md4ai import ContentExtractor

extractor = ContentExtractor()
url = "https://example.com"

# Synchronously extract markdown
markdown_result = extractor.extract_markdown_sync(url)
if markdown_result:
    print(markdown_result["markdown"])

# Synchronously extract HTML
html_content = extractor.extract_html_sync(url)
if html_content:
    print(html_content[:200] + "...")

🛠️ Configuration

The behavior of the ContentExtractor can be customized through a Config object or environment variables.

Example: Custom Configuration

from url2md4ai import ContentExtractor, Config

# Customize configuration
config = Config(
    timeout=60,                  # Page load timeout in seconds
    user_agent="MyTestAgent/1.0", # Custom User-Agent
    output_dir="custom_output",  # Default output directory
    browser_headless=True,       # Run Playwright in headless mode
    wait_for_network_idle=True,  # Wait for network to be idle
    page_wait_timeout=2000       # Additional wait time in ms
)

extractor = ContentExtractor(config=config)

# This will use the custom configuration
extractor.extract_markdown_sync("https://example.com")

See src/url2md4ai/config.py for all available configuration options and their corresponding environment variables.

🤝 Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

📊 Extraction Quality Examples

Before vs After: Real-World Results

# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata

Before (Raw HTML): 51KB, 797 lines

❌ Cookie consent banners
❌ Website navigation
❌ Social media widgets
❌ Advertising content
❌ Footer links and legal text

After (url2md4ai): 9KB, 69 lines

✅ Job title and description
✅ Key requirements
✅ Company benefits
✅ Application process
✅ 97% noise reduction!

Content Types Optimized for LLM

Content Type	Extraction Quality	Best Settings
News Articles	⭐⭐⭐⭐⭐	`--no-js` (faster)
Job Postings	⭐⭐⭐⭐⭐	`--force-js` (complete)
Product Pages	⭐⭐⭐⭐	`--clean` (essential)
Documentation	⭐⭐⭐⭐⭐	`--raw` (preserve structure)
Blog Posts	⭐⭐⭐⭐⭐	default settings
Social Media	⭐⭐⭐	`--force-js` required

📈 Roadmap

Support for more output formats (PDF, DOCX)
Custom CSS selector filtering
Integration with popular LLM APIs
Web UI interface
Plugin system for custom processors
Support for authentication-required pages

Made with ❤️ by Saverio Mazza

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jul 6, 2025

This version

0.1.1

Jul 2, 2025

0.0.4

Jul 1, 2025

0.0.3

Jul 1, 2025

0.0.2

Jul 1, 2025

0.0.1

Jun 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url2md4ai-0.1.1.tar.gz (202.9 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

url2md4ai-0.1.1-py3-none-any.whl (13.0 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file url2md4ai-0.1.1.tar.gz.

File metadata

Download URL: url2md4ai-0.1.1.tar.gz
Upload date: Jul 2, 2025
Size: 202.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`deca2359d6d710018501e0987c7b2e22c55c1528d5bfd5736a4015605043bf3c`
MD5	`2f3e829c6be2ff5e24621beb283994b8`
BLAKE2b-256	`ef159fc212b5e796ff3118a6be710b2a7e085f69d08b14de232d8d5e5c309d28`

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.1.1.tar.gz:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: url2md4ai-0.1.1.tar.gz
- Subject digest: deca2359d6d710018501e0987c7b2e22c55c1528d5bfd5736a4015605043bf3c
- Sigstore transparency entry: 259281731
- Sigstore integration time: Jul 2, 2025
Source repository:
- Permalink: mazzasaverio/url2md4ai@b0b721d2604813640f4e95a54b8520be99e0fa12
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/mazzasaverio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b0b721d2604813640f4e95a54b8520be99e0fa12
- Trigger Event: push

File details

Details for the file url2md4ai-0.1.1-py3-none-any.whl.

File metadata

Download URL: url2md4ai-0.1.1-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8cc08a57085bc82aa3d4fead2cd18cfe1a818bcf90d364f3740d8c95ef0c24f5`
MD5	`fe2cfa7697b95844718c8af01b23f429`
BLAKE2b-256	`428e2ec718b0de5680844891c94c30a5161af3803ca097e6ad3ceb4d6b6f6d8b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.1.1-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: url2md4ai-0.1.1-py3-none-any.whl
- Subject digest: 8cc08a57085bc82aa3d4fead2cd18cfe1a818bcf90d364f3740d8c95ef0c24f5
- Sigstore transparency entry: 259281737
- Sigstore integration time: Jul 2, 2025
Source repository:
- Permalink: mazzasaverio/url2md4ai@b0b721d2604813640f4e95a54b8520be99e0fa12
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/mazzasaverio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b0b721d2604813640f4e95a54b8520be99e0fa12
- Trigger Event: push

url2md4ai 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🚀 url2md4ai

🎯 Why url2md4ai?

✨ Features

⚡ Lean & Efficient

🛠️ Production Ready

🚀 Quick Start

Using uv (Recommended)

Using pip

Using Docker

📖 Usage

Command-Line Interface (CLI)

Convert a URL to Markdown

Extract Raw HTML from a URL

Convert a Local HTML File

Python API

Synchronous Usage

🛠️ Configuration

🤝 Contributing

📄 License

📊 Extraction Quality Examples

Before vs After: Real-World Results

Content Types Optimized for LLM

📈 Roadmap

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Using `uv` (Recommended)

Using `pip`