Skip to main content

๐Ÿš€ Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

Project description

๐Ÿš€ url2md4ai

Python License uv Trafilatura Playwright

๐ŸŽฏ Lean Python tool for extracting clean, LLM-optimized markdown from web pages

Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. Combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering markdown specifically optimized for LLM processing and information extraction.

๐ŸŽฏ Why url2md4ai?

Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.

# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata

# Result: 97% noise reduction (from 51KB to 9KB)
# โœ… Clean job title, description, requirements, benefits
# โŒ No cookie banners, ads, or navigation clutter

Perfect for:

  • ๐Ÿค– AI content analysis workflows
  • ๐Ÿ“Š LLM-based information extraction
  • ๐Ÿ” Web scraping for research and analysis
  • ๐Ÿ“ Content preprocessing for RAG systems
  • ๐ŸŽฏ Automated content monitoring

โœจ Features

๐ŸŽฏ LLM-Optimized Text Extraction

  • ๐Ÿง  Smart Content Extraction: Powered by Trafilatura for intelligent text extraction
  • ๐Ÿš€ Dynamic Content Support: Full JavaScript rendering with Playwright for SPAs and dynamic sites
  • ๐Ÿงน Clean Output: Removes ads, cookie banners, navigation, and other noise for pure content
  • ๐Ÿ“Š Maximum Information Density: Optimized markdown specifically designed for LLM processing

โšก Lean & Efficient

  • ๐ŸŽฏ Focused Purpose: Built specifically for AI/LLM text extraction workflows
  • โšก Fast Processing: Optional non-JavaScript mode for static content (3x faster)
  • ๐Ÿ”ง CLI-First: Simple command-line interface for batch processing and automation
  • ๐Ÿ Python API: Clean programmatic access for integration into AI pipelines

๐Ÿ› ๏ธ Production Ready

  • ๐Ÿ“ Smart Filenames: Generate unique, deterministic filenames using URL hashes
  • ๐Ÿ”„ Batch Processing: Parallel processing support for multiple URLs
  • ๐ŸŽ›๏ธ Configurable: Extensive configuration options for different content types
  • ๐Ÿ“ˆ Reliable: Built-in retry logic and error handling

๐Ÿš€ Quick Start

Using uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync

# Install Playwright browsers
uv run playwright install chromium

# Convert your first URL
uv run url2md4ai convert "https://example.com"

Using pip

pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"

Using Docker

# Build the image
docker build -t url2md4ai .

# Run with URL conversion
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://example.com"

๐Ÿ“– Usage

CLI Commands

Basic Conversion

# Convert a single URL (with metadata)
url2md4ai convert "https://example.com" --show-metadata

# Convert with custom output file
url2md4ai convert "https://example.com" -o my_page.md

# Convert without JavaScript (3x faster for static content)
url2md4ai convert "https://example.com" --no-js

# Raw extraction (no LLM optimization)
url2md4ai convert "https://example.com" --raw

# Get both HTML and Markdown
url2md4ai convert "https://example.com" --raw --save-html --output-dir raw_content  # Get raw HTML
url2md4ai convert "https://example.com" --clean --output-dir clean_content  # Get clean markdown

Batch Processing

# Convert multiple URLs with parallel processing
url2md4ai batch "https://site1.com" "https://site2.com" "https://site3.com" --concurrency 5

# Continue processing even if some URLs fail
url2md4ai batch "https://site1.com" "https://site2.com" --continue-on-error

# Custom output directory
url2md4ai batch "https://example.com" -d /path/to/output

Preview and Utilities

# Preview conversion without saving
url2md4ai preview "https://example.com" --show-content

# Test different extraction methods
url2md4ai test-extraction "https://example.com" --method both --show-diff

# Generate hash filename for URL
url2md4ai hash "https://example.com"

# Show current configuration
url2md4ai config-info --format json

Python API

from url2md4ai import URLToMarkdownConverter, Config

# Initialize converter
config = Config.from_env()
converter = URLToMarkdownConverter(config)

# Convert URL synchronously (perfect for LLM pipelines)
result = converter.convert_url_sync("https://example.com")

if result.success:
    print(f"๐Ÿ“„ Title: {result.title}")
    print(f"๐Ÿ“ Saved as: {result.filename}")
    print(f"๐Ÿ“Š Size: {result.file_size:,} characters")
    print(f"โšก Method: {result.extraction_method}")
    print(f"โฑ๏ธ  Processing time: {result.processing_time:.2f}s")
    
    # Use extracted content for LLM processing
    llm_ready_content = result.markdown
    print("๐Ÿง  LLM-ready content extracted successfully!")
else:
    print(f"โŒ Error: {result.error}")

# Convert URL asynchronously
import asyncio

async def convert_url():
    result = await converter.convert_url("https://example.com")
    return result

result = asyncio.run(convert_url())

# Get both HTML and Markdown from a URL
async def get_html_and_markdown():
    # Initialize converter with raw HTML option
    config = Config(
        clean_content=False,  # Get raw HTML
        llm_optimized=False,  # No extra processing
        wait_for_network_idle=True,  # Wait for dynamic content
        page_wait_timeout=2000  # Wait 2s for dynamic content
    )
    converter = URLToMarkdownConverter(config)
    
    # Get raw HTML first
    result = await converter.convert_url(
        "https://example.com",
        save_to_file=False  # Don't save to file
    )
    raw_html = result.html
    
    # Now get clean markdown with optimizations
    config.clean_content = True
    config.llm_optimized = True
    converter = URLToMarkdownConverter(config)
    
    result = await converter.convert_url(
        "https://example.com",
        save_to_file=True  # Save markdown to file
    )
    clean_markdown = result.markdown
    
    return {
        "html": raw_html,
        "markdown": clean_markdown,
        "title": result.title,
        "metadata": result.metadata
    }

# Use the function
result = asyncio.run(get_html_and_markdown())
print(f"๐Ÿ“„ HTML size: {len(result['html']):,} characters")
print(f"๐Ÿ“ Markdown size: {len(result['markdown']):,} characters")
print(f"๐Ÿท๏ธ  Title: {result['title']}")

#### Advanced Usage

```python
from url2md4ai import URLToMarkdownConverter, Config, URLHasher

# Custom configuration for specific content types
config = Config(
    timeout=60,
    wait_for_network_idle=True,  # Wait for dynamic content
    page_wait_timeout=2000,  # Wait 2s for dynamic content
    clean_content=True,       # Remove ads/banners
    llm_optimized=True,       # Optimize for LLM processing
    remove_cookie_banners=True,
    remove_navigation=True,
    remove_ads=True,
    remove_social_media=True,
    remove_comments=True,
    output_dir="ai_content",
    user_agent="MyAI/1.0"
)

converter = URLToMarkdownConverter(config)

# Convert with maximum cleaning for LLM processing
result = await converter.convert_url(
    url="https://example.com",
    use_trafilatura=True,      # Use intelligent extraction
    use_javascript=True,      # Handle dynamic content
    favor_precision=True,     # Prefer precision over recall
    include_tables=True,      # Include table content
    include_images=False,     # Exclude image references
    include_formatting=True   # Preserve text formatting
)

if result.success:
    # Perfect for feeding into LLMs
    clean_content = result.markdown
    metadata = result.metadata
    
    print(f"๐ŸŽฏ Extraction quality: {result.extraction_method}")
    print(f"๐Ÿ“Š Content size: {result.file_size:,} chars")
    print(f"๐Ÿงน Cleaned and ready for LLM processing!")

# Generate deterministic filenames
hash_value = URLHasher.generate_hash("https://example.com")
filename = URLHasher.generate_filename("https://example.com")
print(f"๐Ÿ”‘ Hash: {hash_value}, ๐Ÿ“ Filename: {filename}")

๐Ÿ“Š Extraction Quality Examples

Before vs After: Real-World Results

# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata

Before (Raw HTML): 51KB, 797 lines

  • โŒ Cookie consent banners
  • โŒ Website navigation
  • โŒ Social media widgets
  • โŒ Advertising content
  • โŒ Footer links and legal text

After (url2md4ai): 9KB, 69 lines

  • โœ… Job title and description
  • โœ… Key requirements
  • โœ… Company benefits
  • โœ… Application process
  • โœ… 97% noise reduction!

Content Types Optimized for LLM

Content Type Extraction Quality Best Settings
News Articles โญโญโญโญโญ --no-js (faster)
Job Postings โญโญโญโญโญ --force-js (complete)
Product Pages โญโญโญโญ --clean (essential)
Documentation โญโญโญโญโญ --raw (preserve structure)
Blog Posts โญโญโญโญโญ default settings
Social Media โญโญโญ --force-js required

โš™๏ธ Configuration

Environment Variables

# Content Extraction Settings
export URL2MD_CLEAN_CONTENT=true
export URL2MD_LLM_OPTIMIZED=true
export URL2MD_USE_TRAFILATURA=true

# Dynamic Content Settings
export URL2MD_WAIT_NETWORK=true
export URL2MD_PAGE_TIMEOUT=2000
export URL2MD_HEADLESS=true

# Content Filtering
export URL2MD_REMOVE_COOKIES=true
export URL2MD_REMOVE_NAV=true
export URL2MD_REMOVE_ADS=true
export URL2MD_REMOVE_SOCIAL=true
export URL2MD_REMOVE_COMMENTS=true

# Advanced Settings
export URL2MD_FAVOR_PRECISION=true
export URL2MD_INCLUDE_TABLES=true
export URL2MD_INCLUDE_IMAGES=false
export URL2MD_INCLUDE_FORMATTING=true

# Output Settings
export URL2MD_OUTPUT_DIR="output"
export URL2MD_USE_HASH_FILENAMES=true

# Performance & Reliability
export URL2MD_TIMEOUT=30
export URL2MD_MAX_RETRIES=3
export URL2MD_USER_AGENT="url2md4ai/1.0"

Configuration Options

Option Default Description
Content Extraction
clean_content true Remove ads, banners, navigation
llm_optimized true Post-process for LLM consumption
use_trafilatura true Use intelligent text extraction
Dynamic Content
wait_for_network_idle true Wait for network activity to finish
page_wait_timeout 2000 Wait time for dynamic content (ms)
browser_headless true Run browser in headless mode
Content Filtering
remove_cookie_banners true Remove cookie consent UI
remove_navigation true Remove nav menus and headers
remove_ads true Remove advertising content
remove_social_media true Remove social sharing widgets
remove_comments true Remove user comments
Advanced Settings
favor_precision true Prefer precision over recall
include_tables true Include table content
include_images false Include image references
include_formatting true Preserve text formatting
Output Settings
output_dir "output" Default output directory
use_hash_filenames true Generate deterministic filenames

๐Ÿณ Docker Usage

๐Ÿ“– See DOCKER_USAGE.md for comprehensive Docker usage examples and troubleshooting.

Quick Start with Docker

# Build the image
docker build -t url2md4ai .

# Convert single URL with LLM optimization
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://example.com" --show-metadata

# Convert dynamic content with JavaScript rendering
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://spa-app.com" --force-js --show-metadata

# Batch processing with parallel workers
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  batch "https://site1.com" "https://site2.com" --concurrency 5 --show-metadata

Using Docker Compose (Recommended)

# Start with compose for easier management
docker compose run --rm url2md4ai convert "https://example.com" --show-metadata

# Development mode with full environment
docker compose run --rm dev

# Batch processing example
docker compose run --rm url2md4ai \
  batch "https://news.site.com/article1" "https://blog.site.com/post2" \
  --concurrency 3 --continue-on-error --show-metadata

Custom Configuration

# Override LLM optimization settings
docker run --rm \
  -v $(pwd)/output:/app/output \
  -e URL2MD_CLEAN_CONTENT=false \
  -e URL2MD_LLM_OPTIMIZED=false \
  url2md4ai \
  convert "https://example.com" --raw

# Disable JavaScript for faster processing
docker run --rm \
  -v $(pwd)/output:/app/output \
  -e URL2MD_JAVASCRIPT=false \
  url2md4ai \
  convert "https://static-site.com" --no-js

๐Ÿ› ๏ธ Development

Setup Development Environment

# Clone repository
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai

# Install with uv
uv sync

# Install Playwright browsers
uv run playwright install

# Run tests
uv run pytest

# Run linting
uv run ruff check
uv run black --check .

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/url2md4ai

# Run specific test
uv run pytest tests/test_converter.py

๐Ÿ“Š Output Format

The tool generates clean, LLM-optimized markdown with:

  • โœ… Preserved heading structure
  • โœ… Clean link formatting
  • โœ… Removed navigation, footer, and sidebar content
  • โœ… Optimized whitespace and line breaks
  • โœ… Title and metadata preservation
  • โœ… Support for complex layouts

Example Output

# Page Title

Main content paragraph with [links](https://example.com) preserved.

## Section Heading

- List items preserved
- Proper formatting maintained

**Bold text** and *italic text* converted correctly.

> Blockquotes maintained

```code blocks preserved```

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Guidelines

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code Quality

  • Use black for code formatting
  • Use ruff for linting
  • Add type hints for all functions
  • Write tests for new features
  • Update documentation as needed

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Trafilatura for intelligent content extraction and web scraping
  • Playwright for JavaScript rendering and dynamic content handling
  • html2text for HTML to Markdown conversion
  • Beautiful Soup for HTML parsing and content cleaning
  • Click for the powerful CLI interface
  • Loguru for elegant logging

๐Ÿ“ˆ Roadmap

  • Support for more output formats (PDF, DOCX)
  • Custom CSS selector filtering
  • Integration with popular LLM APIs
  • Web UI interface
  • Plugin system for custom processors
  • Support for authentication-required pages

Made with โค๏ธ by Saverio Mazza

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url2md4ai-0.0.4.tar.gz (194.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

url2md4ai-0.0.4-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file url2md4ai-0.0.4.tar.gz.

File metadata

  • Download URL: url2md4ai-0.0.4.tar.gz
  • Upload date:
  • Size: 194.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ff111ff4f09f1fec1d098c6af420aa021c121a6a7c35f57bce4411ebc615c548
MD5 177aa4743328d10bcfb72c1a95f608b1
BLAKE2b-256 6650ae76fa01607cd047548e5faddf3081557a0d7950fbaadc6433b4af29ef98

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.0.4.tar.gz:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file url2md4ai-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: url2md4ai-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 982dde729a70dde820b78a374ec3153b352350802f0ec5ff56a5b58a7a1b9eb2
MD5 ecf3f898f72d6654976385b4b41011f3
BLAKE2b-256 8d0ce6e9db6741c8374f204814febbf0dd9e9cb51ea16bfad4d83ffc7d11bcc3

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.0.4-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page