Skip to main content

๐Ÿš€ Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

Project description

๐Ÿš€ url2md4ai

Python License uv Trafilatura Playwright

๐ŸŽฏ Lean Python tool for extracting clean, LLM-optimized markdown from web pages

Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. Combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering markdown specifically optimized for LLM processing and information extraction.

๐ŸŽฏ Why url2md4ai?

Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.

# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata

# Result: 97% noise reduction (from 51KB to 9KB)
# โœ… Clean job title, description, requirements, benefits
# โŒ No cookie banners, ads, or navigation clutter

Perfect for:

  • ๐Ÿค– AI content analysis workflows
  • ๐Ÿ“Š LLM-based information extraction
  • ๐Ÿ” Web scraping for research and analysis
  • ๐Ÿ“ Content preprocessing for RAG systems
  • ๐ŸŽฏ Automated content monitoring

โœจ Features

๐ŸŽฏ LLM-Optimized Text Extraction

  • ๐Ÿง  Smart Content Extraction: Powered by Trafilatura for intelligent text extraction
  • ๐Ÿš€ Dynamic Content Support: Full JavaScript rendering with Playwright for SPAs and dynamic sites
  • ๐Ÿงน Clean Output: Removes ads, cookie banners, navigation, and other noise for pure content
  • ๐Ÿ“Š Maximum Information Density: Optimized markdown specifically designed for LLM processing

โšก Lean & Efficient

  • ๐ŸŽฏ Focused Purpose: Built specifically for AI/LLM text extraction workflows
  • โšก Fast Processing: Optional non-JavaScript mode for static content (3x faster)
  • ๐Ÿ”ง CLI-First: Simple command-line interface for batch processing and automation
  • ๐Ÿ Python API: Clean programmatic access for integration into AI pipelines

๐Ÿ› ๏ธ Production Ready

  • ๐Ÿ“ Smart Filenames: Generate unique, deterministic filenames using URL hashes
  • ๐Ÿ”„ Batch Processing: Parallel processing support for multiple URLs
  • ๐ŸŽ›๏ธ Configurable: Extensive configuration options for different content types
  • ๐Ÿ“ˆ Reliable: Built-in retry logic and error handling

๐Ÿš€ Quick Start

Using uv (Recommended)

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync

# Install Playwright browsers
uv run playwright install chromium

# Convert your first URL
uv run url2md4ai convert "https://example.com"

Using pip

pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"

Using Docker

# Build the image
docker build -t url2md4ai .

# Run with URL conversion
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://example.com"

๐Ÿ“– Usage

CLI Commands

Basic Conversion

# Convert a single URL (with metadata)
url2md4ai convert "https://example.com" --show-metadata

# Convert with custom output file
url2md4ai convert "https://example.com" -o my_page.md

# Convert without JavaScript (3x faster for static content)
url2md4ai convert "https://example.com" --no-js

# Raw extraction (no LLM optimization)
url2md4ai convert "https://example.com" --raw

Batch Processing

# Convert multiple URLs with parallel processing
url2md4ai batch "https://site1.com" "https://site2.com" "https://site3.com" --concurrency 5

# Continue processing even if some URLs fail
url2md4ai batch "https://site1.com" "https://site2.com" --continue-on-error

# Custom output directory
url2md4ai batch "https://example.com" -d /path/to/output

Preview and Utilities

# Preview conversion without saving
url2md4ai preview "https://example.com" --show-content

# Test different extraction methods
url2md4ai test-extraction "https://example.com" --method both --show-diff

# Generate hash filename for URL
url2md4ai hash "https://example.com"

# Show current configuration
url2md4ai config-info --format json

Python API

from url2md4ai import URLToMarkdownConverter, Config

# Initialize converter
config = Config.from_env()
converter = URLToMarkdownConverter(config)

# Convert URL synchronously (perfect for LLM pipelines)
result = converter.convert_url_sync("https://example.com")

if result.success:
    print(f"๐Ÿ“„ Title: {result.title}")
    print(f"๐Ÿ“ Saved as: {result.filename}")
    print(f"๐Ÿ“Š Size: {result.file_size:,} characters")
    print(f"โšก Method: {result.extraction_method}")
    print(f"โฑ๏ธ  Processing time: {result.processing_time:.2f}s")
    
    # Use extracted content for LLM processing
    llm_ready_content = result.markdown
    print("๐Ÿง  LLM-ready content extracted successfully!")
else:
    print(f"โŒ Error: {result.error}")

# Convert URL asynchronously
import asyncio

async def convert_url():
    result = await converter.convert_url("https://example.com")
    return result

result = asyncio.run(convert_url())

Advanced Usage

from url2md4ai import URLToMarkdownConverter, Config, URLHasher

# Custom configuration for specific content types
config = Config(
    timeout=60,
    javascript_enabled=True,  # Essential for SPAs
    clean_content=True,       # Remove ads/banners
    llm_optimized=True,       # Optimize for LLM processing
    remove_cookie_banners=True,
    remove_navigation=True,
    remove_ads=True,
    output_dir="ai_content",
    user_agent="MyAI/1.0"
)

converter = URLToMarkdownConverter(config)

# Convert with maximum cleaning for LLM processing
result = await converter.convert_url(
    url="https://example.com",
    use_javascript=True,      # Handle dynamic content
    use_trafilatura=True      # Use intelligent extraction
)

if result.success:
    # Perfect for feeding into LLMs
    clean_content = result.markdown
    metadata = result.metadata
    
    print(f"๐ŸŽฏ Extraction quality: {result.extraction_method}")
    print(f"๐Ÿ“Š Content size: {result.file_size:,} chars")
    print(f"๐Ÿงน Cleaned and ready for LLM processing!")

# Generate deterministic filenames
hash_value = URLHasher.generate_hash("https://example.com")
filename = URLHasher.generate_filename("https://example.com")
print(f"๐Ÿ”‘ Hash: {hash_value}, ๐Ÿ“ Filename: {filename}")

๐Ÿ“Š Extraction Quality Examples

Before vs After: Real-World Results

# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata

Before (Raw HTML): 51KB, 797 lines

  • โŒ Cookie consent banners
  • โŒ Website navigation
  • โŒ Social media widgets
  • โŒ Advertising content
  • โŒ Footer links and legal text

After (url2md4ai): 9KB, 69 lines

  • โœ… Job title and description
  • โœ… Key requirements
  • โœ… Company benefits
  • โœ… Application process
  • โœ… 97% noise reduction!

Content Types Optimized for LLM

Content Type Extraction Quality Best Settings
News Articles โญโญโญโญโญ --no-js (faster)
Job Postings โญโญโญโญโญ --force-js (complete)
Product Pages โญโญโญโญ --clean (essential)
Documentation โญโญโญโญโญ --raw (preserve structure)
Blog Posts โญโญโญโญโญ default settings
Social Media โญโญโญ --force-js required

โš™๏ธ Configuration

Environment Variables

# LLM-Optimized Extraction Settings
export URL2MD_CLEAN_CONTENT=true
export URL2MD_LLM_OPTIMIZED=true
export URL2MD_USE_TRAFILATURA=true

# Content Filtering (Noise Removal)
export URL2MD_REMOVE_COOKIES=true
export URL2MD_REMOVE_NAV=true
export URL2MD_REMOVE_ADS=true
export URL2MD_REMOVE_SOCIAL=true

# JavaScript Rendering
export URL2MD_JAVASCRIPT=true
export URL2MD_HEADLESS=true
export URL2MD_PAGE_TIMEOUT=2000

# Output Settings
export URL2MD_OUTPUT_DIR="output"
export URL2MD_USE_HASH_FILENAMES=true

# Performance & Reliability
export URL2MD_TIMEOUT=30
export URL2MD_MAX_RETRIES=3
export URL2MD_USER_AGENT="url2md4ai/1.0"

Configuration Options

Option Default Description
LLM Optimization
clean_content true Remove ads, banners, navigation
llm_optimized true Post-process for LLM consumption
use_trafilatura true Use intelligent text extraction
Content Filtering
remove_cookie_banners true Remove cookie consent UI
remove_navigation true Remove nav menus and headers
remove_ads true Remove advertising content
remove_social_media true Remove social sharing widgets
JavaScript Rendering
javascript_enabled true Enable dynamic content rendering
browser_headless true Run browser in headless mode
page_wait_timeout 2000 Wait time for page loading (ms)
Output Settings
output_dir "output" Default output directory
use_hash_filenames true Generate deterministic filenames

๐Ÿณ Docker Usage

๐Ÿ“– See DOCKER_USAGE.md for comprehensive Docker usage examples and troubleshooting.

Quick Start with Docker

# Build the image
docker build -t url2md4ai .

# Convert single URL with LLM optimization
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://example.com" --show-metadata

# Convert dynamic content with JavaScript rendering
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  convert "https://spa-app.com" --force-js --show-metadata

# Batch processing with parallel workers
docker run --rm \
  -v $(pwd)/output:/app/output \
  url2md4ai \
  batch "https://site1.com" "https://site2.com" --concurrency 5 --show-metadata

Using Docker Compose (Recommended)

# Start with compose for easier management
docker compose run --rm url2md4ai convert "https://example.com" --show-metadata

# Development mode with full environment
docker compose run --rm dev

# Batch processing example
docker compose run --rm url2md4ai \
  batch "https://news.site.com/article1" "https://blog.site.com/post2" \
  --concurrency 3 --continue-on-error --show-metadata

Custom Configuration

# Override LLM optimization settings
docker run --rm \
  -v $(pwd)/output:/app/output \
  -e URL2MD_CLEAN_CONTENT=false \
  -e URL2MD_LLM_OPTIMIZED=false \
  url2md4ai \
  convert "https://example.com" --raw

# Disable JavaScript for faster processing
docker run --rm \
  -v $(pwd)/output:/app/output \
  -e URL2MD_JAVASCRIPT=false \
  url2md4ai \
  convert "https://static-site.com" --no-js

๐Ÿ› ๏ธ Development

Setup Development Environment

# Clone repository
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai

# Install with uv
uv sync

# Install Playwright browsers
uv run playwright install

# Run tests
uv run pytest

# Run linting
uv run ruff check
uv run black --check .

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/url2md4ai

# Run specific test
uv run pytest tests/test_converter.py

๐Ÿ“Š Output Format

The tool generates clean, LLM-optimized markdown with:

  • โœ… Preserved heading structure
  • โœ… Clean link formatting
  • โœ… Removed navigation, footer, and sidebar content
  • โœ… Optimized whitespace and line breaks
  • โœ… Title and metadata preservation
  • โœ… Support for complex layouts

Example Output

# Page Title

Main content paragraph with [links](https://example.com) preserved.

## Section Heading

- List items preserved
- Proper formatting maintained

**Bold text** and *italic text* converted correctly.

> Blockquotes maintained

```code blocks preserved```

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Guidelines

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code Quality

  • Use black for code formatting
  • Use ruff for linting
  • Add type hints for all functions
  • Write tests for new features
  • Update documentation as needed

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Trafilatura for intelligent content extraction and web scraping
  • Playwright for JavaScript rendering and dynamic content handling
  • html2text for HTML to Markdown conversion
  • Beautiful Soup for HTML parsing and content cleaning
  • Click for the powerful CLI interface
  • Loguru for elegant logging

๐Ÿ“ˆ Roadmap

  • Support for more output formats (PDF, DOCX)
  • Custom CSS selector filtering
  • Integration with popular LLM APIs
  • Web UI interface
  • Plugin system for custom processors
  • Support for authentication-required pages

Made with โค๏ธ by Saverio Mazza

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url2md4ai-0.0.1.tar.gz (195.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

url2md4ai-0.0.1-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file url2md4ai-0.0.1.tar.gz.

File metadata

  • Download URL: url2md4ai-0.0.1.tar.gz
  • Upload date:
  • Size: 195.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c755a6d2bde2a053edb8b9c9463dae694fc6071615258d06401ab06c7d322f5d
MD5 4b4cbd96551cb1fd2893a39b4806ac34
BLAKE2b-256 7201914f572b840ec6edc06791fc9fe18da264fa59f8529eb2b8463f30d8b8bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.0.1.tar.gz:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file url2md4ai-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: url2md4ai-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for url2md4ai-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7292511a4f8deb8549369dbdbf855e342a6f9a10e35f435284b2baeca936dda6
MD5 91e46c84f9feb4baef79e961c4c3ed66
BLAKE2b-256 66af5bc630ecd8015b8296f928950ee71acf95c989a1754abbdc0694e9525572

See more details on using hashes here.

Provenance

The following attestation bundles were made for url2md4ai-0.0.1-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/url2md4ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page