🚀 Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

These details have not been verified by PyPI

Project description

🚀 url2md4ai

Python License Trafilatura Playwright

🎯 Lean Python tool for extracting clean, LLM-optimized markdown from web pages

Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. Combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering markdown specifically optimized for LLM processing and information extraction.

🎯 Why url2md4ai?

Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.

# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata

# Result: 97% noise reduction (from 51KB to 9KB)
# ✅ Clean job title, description, requirements, benefits
# ❌ No cookie banners, ads, or navigation clutter

Perfect for:

🤖 AI content analysis workflows
📊 LLM-based information extraction
🔍 Web scraping for research and analysis
📝 Content preprocessing for RAG systems
🎯 Automated content monitoring

✨ Features

🎯 LLM-Optimized Text Extraction

🧠 Smart Content Extraction: Powered by Trafilatura for intelligent text extraction
🚀 Dynamic Content Support: Full JavaScript rendering with Playwright for SPAs and dynamic sites
🧹 Clean Output: Removes ads, cookie banners, navigation, and other noise for pure content
📊 Maximum Information Density: Optimized markdown specifically designed for LLM processing

⚡ Lean & Efficient

🎯 Focused Purpose: Built specifically for AI/LLM text extraction workflows
⚡ Fast Processing: Optional non-JavaScript mode for static content (3x faster)
🔧 CLI-First: Simple command-line interface for batch processing and automation
🐍 Python API: Clean programmatic access for integration into AI pipelines

🛠️ Production Ready

📁 Smart Filenames: Generate unique, deterministic filenames using URL hashes
🔄 Batch Processing: Parallel processing support for multiple URLs
🎛️ Configurable: Extensive configuration options for different content types
📈 Reliable: Built-in retry logic and error handling

🚀 Quick Start

Using uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync

# Install Playwright browsers
uv run playwright install chromium

# Convert your first URL
uv run url2md4ai convert "https://example.com"

Using pip

pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"

Using Docker

# Build the image
docker build -t url2md4ai .

# Run with URL conversion
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://example.com"

📖 Usage

CLI Commands

Basic Conversion

# Convert a single URL (with metadata)
url2md4ai convert "https://example.com" --show-metadata

# Convert with custom output file
url2md4ai convert "https://example.com" -o my_page.md

# Convert without JavaScript (3x faster for static content)
url2md4ai convert "https://example.com" --no-js

# Raw extraction (no LLM optimization)
url2md4ai convert "https://example.com" --raw

# Get both HTML and Markdown
url2md4ai convert "https://example.com" --raw --save-html --output-dir raw_content  # Get raw HTML
url2md4ai convert "https://example.com" --clean --output-dir clean_content  # Get clean markdown

Batch Processing

# Convert multiple URLs with parallel processing
url2md4ai batch "https://site1.com" "https://site2.com" "https://site3.com" --concurrency 5

# Continue processing even if some URLs fail
url2md4ai batch "https://site1.com" "https://site2.com" --continue-on-error

# Custom output directory
url2md4ai batch "https://example.com" -d /path/to/output

Preview and Utilities

# Preview conversion without saving
url2md4ai preview "https://example.com" --show-content

# Test different extraction methods
url2md4ai test-extraction "https://example.com" --method both --show-diff

# Generate hash filename for URL
url2md4ai hash "https://example.com"

# Show current configuration
url2md4ai config-info --format json

Python API

from url2md4ai import URLToMarkdownConverter, Config

# Initialize converter
config = Config.from_env()
converter = URLToMarkdownConverter(config)

# Convert URL synchronously (perfect for LLM pipelines)
result = converter.convert_url_sync("https://example.com")

if result.success:
    print(f"📄 Title: {result.title}")
    print(f"📁 Saved as: {result.filename}")
    print(f"📊 Size: {result.file_size:,} characters")
    print(f"⚡ Method: {result.extraction_method}")
    print(f"⏱️  Processing time: {result.processing_time:.2f}s")
    
    # Use extracted content for LLM processing
    llm_ready_content = result.markdown
    print("🧠 LLM-ready content extracted successfully!")
else:
    print(f"❌ Error: {result.error}")

# Convert URL asynchronously
import asyncio

async def convert_url():
    result = await converter.convert_url("https://example.com")
    return result

result = asyncio.run(convert_url())

# Get both HTML and Markdown from a URL
async def get_html_and_markdown():
    # Initialize converter with raw HTML option
    config = Config(
        clean_content=False,  # Get raw HTML
        llm_optimized=False,  # No extra processing
        wait_for_network_idle=True,  # Wait for dynamic content
        page_wait_timeout=2000  # Wait 2s for dynamic content
    )
    converter = URLToMarkdownConverter(config)
    
    # Get raw HTML first
    result = await converter.convert_url(
        "https://example.com",
        save_to_file=False  # Don't save to file
    )
    raw_html = result.html
    
    # Now get clean markdown with optimizations
    config.clean_content = True
    config.llm_optimized = True
    converter = URLToMarkdownConverter(config)
    
    result = await converter.convert_url(
        "https://example.com",
        save_to_file=True  # Save markdown to file
    )
    clean_markdown = result.markdown
    
    return {
        "html": raw_html,
        "markdown": clean_markdown,
        "title": result.title,
        "metadata": result.metadata
    }

# Use the function
result = asyncio.run(get_html_and_markdown())
print(f"📄 HTML size: {len(result['html']):,} characters")
print(f"📝 Markdown size: {len(result['markdown']):,} characters")
print(f"🏷️  Title: {result['title']}")

#### Advanced Usage

```python
from url2md4ai import URLToMarkdownConverter, Config, URLHasher

# Custom configuration for specific content types
config = Config(
    timeout=60,
    wait_for_network_idle=True,  # Wait for dynamic content
    page_wait_timeout=2000,  # Wait 2s for dynamic content
    clean_content=True,       # Remove ads/banners
    llm_optimized=True,       # Optimize for LLM processing
    remove_cookie_banners=True,
    remove_navigation=True,
    remove_ads=True,
    remove_social_media=True,
    remove_comments=True,
    output_dir="ai_content",
    user_agent="MyAI/1.0"
)

converter = URLToMarkdownConverter(config)

# Convert with maximum cleaning for LLM processing
result = await converter.convert_url(
    url="https://example.com",
    use_trafilatura=True,      # Use intelligent extraction
    use_javascript=True,      # Handle dynamic content
    favor_precision=True,     # Prefer precision over recall
    include_tables=True,      # Include table content
    include_images=False,     # Exclude image references
    include_formatting=True   # Preserve text formatting
)

if result.success:
    # Perfect for feeding into LLMs
    clean_content = result.markdown
    metadata = result.metadata
    
    print(f"🎯 Extraction quality: {result.extraction_method}")
    print(f"📊 Content size: {result.file_size:,} chars")
    print(f"🧹 Cleaned and ready for LLM processing!")

# Generate deterministic filenames
hash_value = URLHasher.generate_hash("https://example.com")
filename = URLHasher.generate_filename("https://example.com")
print(f"🔑 Hash: {hash_value}, 📁 Filename: {filename}")

📊 Extraction Quality Examples

Before vs After: Real-World Results

# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata

Before (Raw HTML): 51KB, 797 lines

❌ Cookie consent banners
❌ Website navigation
❌ Social media widgets
❌ Advertising content
❌ Footer links and legal text

After (url2md4ai): 9KB, 69 lines

✅ Job title and description
✅ Key requirements
✅ Company benefits
✅ Application process
✅ 97% noise reduction!

Content Types Optimized for LLM

Content Type	Extraction Quality	Best Settings
News Articles	⭐⭐⭐⭐⭐	`--no-js` (faster)
Job Postings	⭐⭐⭐⭐⭐	`--force-js` (complete)
Product Pages	⭐⭐⭐⭐	`--clean` (essential)
Documentation	⭐⭐⭐⭐⭐	`--raw` (preserve structure)
Blog Posts	⭐⭐⭐⭐⭐	default settings
Social Media	⭐⭐⭐	`--force-js` required

⚙️ Configuration

Environment Variables

# Content Extraction Settings
export URL2MD_CLEAN_CONTENT=true
export URL2MD_LLM_OPTIMIZED=true
export URL2MD_USE_TRAFILATURA=true

# Dynamic Content Settings
export URL2MD_WAIT_NETWORK=true
export URL2MD_PAGE_TIMEOUT=2000
export URL2MD_HEADLESS=true

# Content Filtering
export URL2MD_REMOVE_COOKIES=true
export URL2MD_REMOVE_NAV=true
export URL2MD_REMOVE_ADS=true
export URL2MD_REMOVE_SOCIAL=true
export URL2MD_REMOVE_COMMENTS=true

# Advanced Settings
export URL2MD_FAVOR_PRECISION=true
export URL2MD_INCLUDE_TABLES=true
export URL2MD_INCLUDE_IMAGES=false
export URL2MD_INCLUDE_FORMATTING=true

# Output Settings
export URL2MD_OUTPUT_DIR="output"
export URL2MD_USE_HASH_FILENAMES=true

# Performance & Reliability
export URL2MD_TIMEOUT=30
export URL2MD_MAX_RETRIES=3
export URL2MD_USER_AGENT="url2md4ai/1.0"

Configuration Options

Option	Default	Description
Content Extraction
`clean_content`	true	Remove ads, banners, navigation
`llm_optimized`	true	Post-process for LLM consumption
`use_trafilatura`	true	Use intelligent text extraction
Dynamic Content
`wait_for_network_idle`	true	Wait for network activity to finish
`page_wait_timeout`	2000	Wait time for dynamic content (ms)
`browser_headless`	true	Run browser in headless mode
Content Filtering
`remove_cookie_banners`	true	Remove cookie consent UI
`remove_navigation`	true	Remove nav menus and headers
`remove_ads`	true	Remove advertising content
`remove_social_media`	true	Remove social sharing widgets
`remove_comments`	true	Remove user comments
Advanced Settings
`favor_precision`	true	Prefer precision over recall
`include_tables`	true	Include table content
`include_images`	false	Include image references
`include_formatting`	true	Preserve text formatting
Output Settings
`output_dir`	"output"	Default output directory
`use_hash_filenames`	true	Generate deterministic filenames

🐳 Docker Usage

📖 See DOCKER_USAGE.md for comprehensive Docker usage examples and troubleshooting.

Quick Start with Docker

# Build the image
docker build -t url2md4ai .

# Convert single URL with LLM optimization
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://example.com" --show-metadata

# Convert dynamic content with JavaScript rendering
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://spa-app.com" --force-js --show-metadata

# Batch processing with parallel workers
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  batch "https://site1.com" "https://site2.com" --concurrency 5 --show-metadata

Using Docker Compose (Recommended)

# Start with compose for easier management
docker compose run --rm url2md4ai convert "https://example.com" --show-metadata

# Development mode with full environment
docker compose run --rm dev

# Batch processing example
docker compose run --rm url2md4ai \
  batch "https://news.site.com/article1" "https://blog.site.com/post2" \
  --concurrency 3 --continue-on-error --show-metadata

Custom Configuration

# Override LLM optimization settings
docker run --rm \
  -v $(pwd)/output:/app/output \
  -e URL2MD_CLEAN_CONTENT=false \
  -e URL2MD_LLM_OPTIMIZED=false \
  url2md4ai \
  convert "https://example.com" --raw

# Disable JavaScript for faster processing
docker run --rm \
  -v $(pwd)/output:/app/output \
  -e URL2MD_JAVASCRIPT=false \
  url2md4ai \
  convert "https://static-site.com" --no-js

🛠️ Development

Setup Development Environment

# Clone repository
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai

# Install with uv
uv sync

# Install Playwright browsers
uv run playwright install

# Run tests
uv run pytest

# Run linting
uv run ruff check
uv run black --check .

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/url2md4ai

# Run specific test
uv run pytest tests/test_converter.py

📊 Output Format

The tool generates clean, LLM-optimized markdown with:

✅ Preserved heading structure
✅ Clean link formatting
✅ Removed navigation, footer, and sidebar content
✅ Optimized whitespace and line breaks
✅ Title and metadata preservation
✅ Support for complex layouts

Example Output

# Page Title

Main content paragraph with [links](https://example.com) preserved.

## Section Heading

- List items preserved
- Proper formatting maintained

**Bold text** and *italic text* converted correctly.

> Blockquotes maintained

```code blocks preserved```

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Guidelines

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Quality

Use black for code formatting
Use ruff for linting
Add type hints for all functions
Write tests for new features
Update documentation as needed

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Trafilatura for intelligent content extraction and web scraping
Playwright for JavaScript rendering and dynamic content handling
html2text for HTML to Markdown conversion
Beautiful Soup for HTML parsing and content cleaning
Click for the powerful CLI interface
Loguru for elegant logging

📈 Roadmap

Support for more output formats (PDF, DOCX)
Custom CSS selector filtering
Integration with popular LLM APIs
Web UI interface
Plugin system for custom processors
Support for authentication-required pages

Made with ❤️ by Saverio Mazza

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jul 6, 2025

0.1.1

Jul 2, 2025

0.0.4

Jul 1, 2025

This version

0.0.3

Jul 1, 2025

0.0.2

Jul 1, 2025

0.0.1

Jun 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url2md4ai-0.0.3.tar.gz (193.7 kB view details)

Uploaded Jul 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

url2md4ai-0.0.3-py3-none-any.whl (16.7 kB view details)

Uploaded Jul 1, 2025 Python 3

File details

Details for the file url2md4ai-0.0.3.tar.gz.

File metadata

Download URL: url2md4ai-0.0.3.tar.gz
Upload date: Jul 1, 2025
Size: 193.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`a1b3e57e70503aa19547d179002fdb3d2996b9cc5aae3b59f83d066fdbae32f6`
MD5	`82a3ec6c44a5f1400a4cf987f6fd8bef`
BLAKE2b-256	`fe03ae0cbbd92309afa80639e1cd0acb6f6a356156672a13bb1feeb787d56483`

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.0.3.tar.gz:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: url2md4ai-0.0.3.tar.gz
- Subject digest: a1b3e57e70503aa19547d179002fdb3d2996b9cc5aae3b59f83d066fdbae32f6
- Sigstore transparency entry: 257652676
- Sigstore integration time: Jul 1, 2025
Source repository:
- Permalink: mazzasaverio/url2md4ai@54947971f74da2ff36e3615c6e51a34642e22cb9
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/mazzasaverio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@54947971f74da2ff36e3615c6e51a34642e22cb9
- Trigger Event: push

File details

Details for the file url2md4ai-0.0.3-py3-none-any.whl.

File metadata

Download URL: url2md4ai-0.0.3-py3-none-any.whl
Upload date: Jul 1, 2025
Size: 16.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e49f05767a985bd09170392eccadcf099165b0148e8a3f906b1891cc8abfb61`
MD5	`f445b4686f5c1bb34e6de1991dd3fa71`
BLAKE2b-256	`fecbb508e0d79fbd8bcec0c35bddf24f3bd75ffbc8c401195ded7efa5d7d66a3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.0.3-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: url2md4ai-0.0.3-py3-none-any.whl
- Subject digest: 7e49f05767a985bd09170392eccadcf099165b0148e8a3f906b1891cc8abfb61
- Sigstore transparency entry: 257652681
- Sigstore integration time: Jul 1, 2025
Source repository:
- Permalink: mazzasaverio/url2md4ai@54947971f74da2ff36e3615c6e51a34642e22cb9
- Branch / Tag: refs/tags/v0.0.3
- Owner: https://github.com/mazzasaverio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@54947971f74da2ff36e3615c6e51a34642e22cb9
- Trigger Event: push

url2md4ai 0.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🚀 url2md4ai

🎯 Why url2md4ai?

✨ Features

🎯 LLM-Optimized Text Extraction

⚡ Lean & Efficient

🛠️ Production Ready

🚀 Quick Start

Using uv (Recommended)

Using pip

Using Docker

📖 Usage

CLI Commands

Basic Conversion

Batch Processing

Preview and Utilities

Python API

📊 Extraction Quality Examples

Before vs After: Real-World Results

Content Types Optimized for LLM

⚙️ Configuration

Environment Variables

Configuration Options

🐳 Docker Usage

Quick Start with Docker

Using Docker Compose (Recommended)

Custom Configuration

🛠️ Development

Setup Development Environment

Running Tests

📊 Output Format

Example Output

🤝 Contributing

Development Guidelines

Code Quality

📄 License

🙏 Acknowledgments

📈 Roadmap

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance