๐ Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
Project description
๐ url2md4ai
๐ฏ Lean Python tool for extracting clean, LLM-optimized markdown from web pages
Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. Combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering markdown specifically optimized for LLM processing and information extraction.
๐ฏ Why url2md4ai?
Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.
# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata
# Result: 97% noise reduction (from 51KB to 9KB)
# โ
Clean job title, description, requirements, benefits
# โ No cookie banners, ads, or navigation clutter
Perfect for:
- ๐ค AI content analysis workflows
- ๐ LLM-based information extraction
- ๐ Web scraping for research and analysis
- ๐ Content preprocessing for RAG systems
- ๐ฏ Automated content monitoring
โจ Features
๐ฏ LLM-Optimized Text Extraction
- ๐ง Smart Content Extraction: Powered by Trafilatura for intelligent text extraction
- ๐ Dynamic Content Support: Full JavaScript rendering with Playwright for SPAs and dynamic sites
- ๐งน Clean Output: Removes ads, cookie banners, navigation, and other noise for pure content
- ๐ Maximum Information Density: Optimized markdown specifically designed for LLM processing
โก Lean & Efficient
- ๐ฏ Focused Purpose: Built specifically for AI/LLM text extraction workflows
- โก Fast Processing: Optional non-JavaScript mode for static content (3x faster)
- ๐ง CLI-First: Simple command-line interface for batch processing and automation
- ๐ Python API: Clean programmatic access for integration into AI pipelines
๐ ๏ธ Production Ready
- ๐ Smart Filenames: Generate unique, deterministic filenames using URL hashes
- ๐ Batch Processing: Parallel processing support for multiple URLs
- ๐๏ธ Configurable: Extensive configuration options for different content types
- ๐ Reliable: Built-in retry logic and error handling
๐ Quick Start
Using uv (Recommended)
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync
# Install Playwright browsers
uv run playwright install chromium
# Convert your first URL
uv run url2md4ai convert "https://example.com"
Using pip
pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"
Using Docker
# Build the image
docker build -t url2md4ai .
# Run with URL conversion
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://example.com"
๐ Usage
CLI Commands
Basic Conversion
# Convert a single URL (with metadata)
url2md4ai convert "https://example.com" --show-metadata
# Convert with custom output file
url2md4ai convert "https://example.com" -o my_page.md
# Convert without JavaScript (3x faster for static content)
url2md4ai convert "https://example.com" --no-js
# Raw extraction (no LLM optimization)
url2md4ai convert "https://example.com" --raw
Batch Processing
# Convert multiple URLs with parallel processing
url2md4ai batch "https://site1.com" "https://site2.com" "https://site3.com" --concurrency 5
# Continue processing even if some URLs fail
url2md4ai batch "https://site1.com" "https://site2.com" --continue-on-error
# Custom output directory
url2md4ai batch "https://example.com" -d /path/to/output
Preview and Utilities
# Preview conversion without saving
url2md4ai preview "https://example.com" --show-content
# Test different extraction methods
url2md4ai test-extraction "https://example.com" --method both --show-diff
# Generate hash filename for URL
url2md4ai hash "https://example.com"
# Show current configuration
url2md4ai config-info --format json
Python API
from url2md4ai import URLToMarkdownConverter, Config
# Initialize converter
config = Config.from_env()
converter = URLToMarkdownConverter(config)
# Convert URL synchronously (perfect for LLM pipelines)
result = converter.convert_url_sync("https://example.com")
if result.success:
print(f"๐ Title: {result.title}")
print(f"๐ Saved as: {result.filename}")
print(f"๐ Size: {result.file_size:,} characters")
print(f"โก Method: {result.extraction_method}")
print(f"โฑ๏ธ Processing time: {result.processing_time:.2f}s")
# Use extracted content for LLM processing
llm_ready_content = result.markdown
print("๐ง LLM-ready content extracted successfully!")
else:
print(f"โ Error: {result.error}")
# Convert URL asynchronously
import asyncio
async def convert_url():
result = await converter.convert_url("https://example.com")
return result
result = asyncio.run(convert_url())
Advanced Usage
from url2md4ai import URLToMarkdownConverter, Config, URLHasher
# Custom configuration for specific content types
config = Config(
timeout=60,
javascript_enabled=True, # Essential for SPAs
clean_content=True, # Remove ads/banners
llm_optimized=True, # Optimize for LLM processing
remove_cookie_banners=True,
remove_navigation=True,
remove_ads=True,
output_dir="ai_content",
user_agent="MyAI/1.0"
)
converter = URLToMarkdownConverter(config)
# Convert with maximum cleaning for LLM processing
result = await converter.convert_url(
url="https://example.com",
use_javascript=True, # Handle dynamic content
use_trafilatura=True # Use intelligent extraction
)
if result.success:
# Perfect for feeding into LLMs
clean_content = result.markdown
metadata = result.metadata
print(f"๐ฏ Extraction quality: {result.extraction_method}")
print(f"๐ Content size: {result.file_size:,} chars")
print(f"๐งน Cleaned and ready for LLM processing!")
# Generate deterministic filenames
hash_value = URLHasher.generate_hash("https://example.com")
filename = URLHasher.generate_filename("https://example.com")
print(f"๐ Hash: {hash_value}, ๐ Filename: {filename}")
๐ Extraction Quality Examples
Before vs After: Real-World Results
# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata
Before (Raw HTML): 51KB, 797 lines
- โ Cookie consent banners
- โ Website navigation
- โ Social media widgets
- โ Advertising content
- โ Footer links and legal text
After (url2md4ai): 9KB, 69 lines
- โ Job title and description
- โ Key requirements
- โ Company benefits
- โ Application process
- โ 97% noise reduction!
Content Types Optimized for LLM
| Content Type | Extraction Quality | Best Settings |
|---|---|---|
| News Articles | โญโญโญโญโญ | --no-js (faster) |
| Job Postings | โญโญโญโญโญ | --force-js (complete) |
| Product Pages | โญโญโญโญ | --clean (essential) |
| Documentation | โญโญโญโญโญ | --raw (preserve structure) |
| Blog Posts | โญโญโญโญโญ | default settings |
| Social Media | โญโญโญ | --force-js required |
โ๏ธ Configuration
Environment Variables
# LLM-Optimized Extraction Settings
export URL2MD_CLEAN_CONTENT=true
export URL2MD_LLM_OPTIMIZED=true
export URL2MD_USE_TRAFILATURA=true
# Content Filtering (Noise Removal)
export URL2MD_REMOVE_COOKIES=true
export URL2MD_REMOVE_NAV=true
export URL2MD_REMOVE_ADS=true
export URL2MD_REMOVE_SOCIAL=true
# JavaScript Rendering
export URL2MD_JAVASCRIPT=true
export URL2MD_HEADLESS=true
export URL2MD_PAGE_TIMEOUT=2000
# Output Settings
export URL2MD_OUTPUT_DIR="output"
export URL2MD_USE_HASH_FILENAMES=true
# Performance & Reliability
export URL2MD_TIMEOUT=30
export URL2MD_MAX_RETRIES=3
export URL2MD_USER_AGENT="url2md4ai/1.0"
Configuration Options
| Option | Default | Description |
|---|---|---|
| LLM Optimization | ||
clean_content |
true | Remove ads, banners, navigation |
llm_optimized |
true | Post-process for LLM consumption |
use_trafilatura |
true | Use intelligent text extraction |
| Content Filtering | ||
remove_cookie_banners |
true | Remove cookie consent UI |
remove_navigation |
true | Remove nav menus and headers |
remove_ads |
true | Remove advertising content |
remove_social_media |
true | Remove social sharing widgets |
| JavaScript Rendering | ||
javascript_enabled |
true | Enable dynamic content rendering |
browser_headless |
true | Run browser in headless mode |
page_wait_timeout |
2000 | Wait time for page loading (ms) |
| Output Settings | ||
output_dir |
"output" | Default output directory |
use_hash_filenames |
true | Generate deterministic filenames |
๐ณ Docker Usage
๐ See DOCKER_USAGE.md for comprehensive Docker usage examples and troubleshooting.
Quick Start with Docker
# Build the image
docker build -t url2md4ai .
# Convert single URL with LLM optimization
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://example.com" --show-metadata
# Convert dynamic content with JavaScript rendering
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://spa-app.com" --force-js --show-metadata
# Batch processing with parallel workers
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
batch "https://site1.com" "https://site2.com" --concurrency 5 --show-metadata
Using Docker Compose (Recommended)
# Start with compose for easier management
docker compose run --rm url2md4ai convert "https://example.com" --show-metadata
# Development mode with full environment
docker compose run --rm dev
# Batch processing example
docker compose run --rm url2md4ai \
batch "https://news.site.com/article1" "https://blog.site.com/post2" \
--concurrency 3 --continue-on-error --show-metadata
Custom Configuration
# Override LLM optimization settings
docker run --rm \
-v $(pwd)/output:/app/output \
-e URL2MD_CLEAN_CONTENT=false \
-e URL2MD_LLM_OPTIMIZED=false \
url2md4ai \
convert "https://example.com" --raw
# Disable JavaScript for faster processing
docker run --rm \
-v $(pwd)/output:/app/output \
-e URL2MD_JAVASCRIPT=false \
url2md4ai \
convert "https://static-site.com" --no-js
๐ ๏ธ Development
Setup Development Environment
# Clone repository
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
# Install with uv
uv sync
# Install Playwright browsers
uv run playwright install
# Run tests
uv run pytest
# Run linting
uv run ruff check
uv run black --check .
Running Tests
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src/url2md4ai
# Run specific test
uv run pytest tests/test_converter.py
๐ Output Format
The tool generates clean, LLM-optimized markdown with:
- โ Preserved heading structure
- โ Clean link formatting
- โ Removed navigation, footer, and sidebar content
- โ Optimized whitespace and line breaks
- โ Title and metadata preservation
- โ Support for complex layouts
Example Output
# Page Title
Main content paragraph with [links](https://example.com) preserved.
## Section Heading
- List items preserved
- Proper formatting maintained
**Bold text** and *italic text* converted correctly.
> Blockquotes maintained
```code blocks preserved```
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Guidelines
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Code Quality
- Use
blackfor code formatting - Use
rufffor linting - Add type hints for all functions
- Write tests for new features
- Update documentation as needed
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Trafilatura for intelligent content extraction and web scraping
- Playwright for JavaScript rendering and dynamic content handling
- html2text for HTML to Markdown conversion
- Beautiful Soup for HTML parsing and content cleaning
- Click for the powerful CLI interface
- Loguru for elegant logging
๐ Roadmap
- Support for more output formats (PDF, DOCX)
- Custom CSS selector filtering
- Integration with popular LLM APIs
- Web UI interface
- Plugin system for custom processors
- Support for authentication-required pages
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file url2md4ai-0.0.2.tar.gz.
File metadata
- Download URL: url2md4ai-0.0.2.tar.gz
- Upload date:
- Size: 193.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6a0a17d6177b24898b51b28169fef5237ad56e61204f8e64b849749dd6d3279
|
|
| MD5 |
c79ad932f09b3b48e3f59e50ce0dceda
|
|
| BLAKE2b-256 |
52ccfd7b62c5e701bacd193f1076e006aa25b3401dfbb41341ca0f99294f9c91
|
Provenance
The following attestation bundles were made for url2md4ai-0.0.2.tar.gz:
Publisher:
release.yml on mazzasaverio/url2md4ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
url2md4ai-0.0.2.tar.gz -
Subject digest:
f6a0a17d6177b24898b51b28169fef5237ad56e61204f8e64b849749dd6d3279 - Sigstore transparency entry: 257645582
- Sigstore integration time:
-
Permalink:
mazzasaverio/url2md4ai@de5118f6c64c3b2d9058c622b3a82e4cb2b05bc1 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/mazzasaverio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@de5118f6c64c3b2d9058c622b3a82e4cb2b05bc1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file url2md4ai-0.0.2-py3-none-any.whl.
File metadata
- Download URL: url2md4ai-0.0.2-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1549ea7ec267a237ae877c554d64b3ff743c498704a91e3d574e97bcdf969400
|
|
| MD5 |
44cccaf14b7c8c482985220a1ece5298
|
|
| BLAKE2b-256 |
e165a23bd2815357fa2363f52cbadbbaafde2e36cba626a4ffdcc38e065212bb
|
Provenance
The following attestation bundles were made for url2md4ai-0.0.2-py3-none-any.whl:
Publisher:
release.yml on mazzasaverio/url2md4ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
url2md4ai-0.0.2-py3-none-any.whl -
Subject digest:
1549ea7ec267a237ae877c554d64b3ff743c498704a91e3d574e97bcdf969400 - Sigstore transparency entry: 257645587
- Sigstore integration time:
-
Permalink:
mazzasaverio/url2md4ai@de5118f6c64c3b2d9058c622b3a82e4cb2b05bc1 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/mazzasaverio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@de5118f6c64c3b2d9058c622b3a82e4cb2b05bc1 -
Trigger Event:
push
-
Statement type: