๐ Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
Project description
๐ url2md4ai
๐ฏ Lean Python tool for extracting clean, LLM-optimized markdown from web pages
Perfect for AI applications that need high-quality text extraction from both static and dynamic web content. Combines Playwright for JavaScript rendering with Trafilatura for intelligent content extraction, delivering markdown specifically optimized for LLM processing and information extraction.
๐ฏ Why url2md4ai?
Traditional tools extract everything: ads, cookie banners, navigation menus, social media widgets...
url2md4ai extracts only what matters: clean, structured content ready for LLM processing.
# Example: Extract job posting from Satispay careers page
url2md4ai convert "https://www.satispay.com/careers/job-posting" --show-metadata
# Result: 97% noise reduction (from 51KB to 9KB)
# โ
Clean job title, description, requirements, benefits
# โ No cookie banners, ads, or navigation clutter
Perfect for:
- ๐ค AI content analysis workflows
- ๐ LLM-based information extraction
- ๐ Web scraping for research and analysis
- ๐ Content preprocessing for RAG systems
- ๐ฏ Automated content monitoring
โจ Features
๐ฏ LLM-Optimized Text Extraction
- ๐ง Smart Content Extraction: Powered by Trafilatura for intelligent text extraction
- ๐ Dynamic Content Support: Full JavaScript rendering with Playwright for SPAs and dynamic sites
- ๐งน Clean Output: Removes ads, cookie banners, navigation, and other noise for pure content
- ๐ Maximum Information Density: Optimized markdown specifically designed for LLM processing
โก Lean & Efficient
- ๐ฏ Focused Purpose: Built specifically for AI/LLM text extraction workflows
- โก Fast Processing: Optional non-JavaScript mode for static content (3x faster)
- ๐ง CLI-First: Simple command-line interface for batch processing and automation
- ๐ Python API: Clean programmatic access for integration into AI pipelines
๐ ๏ธ Production Ready
- ๐ Smart Filenames: Generate unique, deterministic filenames using URL hashes
- ๐ Batch Processing: Parallel processing support for multiple URLs
- ๐๏ธ Configurable: Extensive configuration options for different content types
- ๐ Reliable: Built-in retry logic and error handling
๐ Quick Start
Using uv (Recommended)
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
uv sync
# Install Playwright browsers
uv run playwright install chromium
# Convert your first URL
uv run url2md4ai convert "https://example.com"
Using pip
pip install url2md4ai
playwright install chromium
url2md4ai convert "https://example.com"
Using Docker
# Build the image
docker build -t url2md4ai .
# Run with URL conversion
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://example.com"
๐ Usage
CLI Commands
Basic Conversion
# Convert a single URL (with metadata)
url2md4ai convert "https://example.com" --show-metadata
# Convert with custom output file
url2md4ai convert "https://example.com" -o my_page.md
# Convert without JavaScript (3x faster for static content)
url2md4ai convert "https://example.com" --no-js
# Raw extraction (no LLM optimization)
url2md4ai convert "https://example.com" --raw
# Get both HTML and Markdown
url2md4ai convert "https://example.com" --raw --save-html --output-dir raw_content # Get raw HTML
url2md4ai convert "https://example.com" --clean --output-dir clean_content # Get clean markdown
Batch Processing
# Convert multiple URLs with parallel processing
url2md4ai batch "https://site1.com" "https://site2.com" "https://site3.com" --concurrency 5
# Continue processing even if some URLs fail
url2md4ai batch "https://site1.com" "https://site2.com" --continue-on-error
# Custom output directory
url2md4ai batch "https://example.com" -d /path/to/output
Preview and Utilities
# Preview conversion without saving
url2md4ai preview "https://example.com" --show-content
# Test different extraction methods
url2md4ai test-extraction "https://example.com" --method both --show-diff
# Generate hash filename for URL
url2md4ai hash "https://example.com"
# Show current configuration
url2md4ai config-info --format json
Python API
from url2md4ai import URLToMarkdownConverter, Config
# Initialize converter
config = Config.from_env()
converter = URLToMarkdownConverter(config)
# Convert URL synchronously (perfect for LLM pipelines)
result = converter.convert_url_sync("https://example.com")
if result.success:
print(f"๐ Title: {result.title}")
print(f"๐ Saved as: {result.filename}")
print(f"๐ Size: {result.file_size:,} characters")
print(f"โก Method: {result.extraction_method}")
print(f"โฑ๏ธ Processing time: {result.processing_time:.2f}s")
# Use extracted content for LLM processing
llm_ready_content = result.markdown
print("๐ง LLM-ready content extracted successfully!")
else:
print(f"โ Error: {result.error}")
# Convert URL asynchronously
import asyncio
async def convert_url():
result = await converter.convert_url("https://example.com")
return result
result = asyncio.run(convert_url())
# Get both HTML and Markdown from a URL
async def get_html_and_markdown():
# Initialize converter with raw HTML option
config = Config(
clean_content=False, # Get raw HTML
llm_optimized=False, # No extra processing
wait_for_network_idle=True, # Wait for dynamic content
page_wait_timeout=2000 # Wait 2s for dynamic content
)
converter = URLToMarkdownConverter(config)
# Get raw HTML first
result = await converter.convert_url(
"https://example.com",
save_to_file=False # Don't save to file
)
raw_html = result.html
# Now get clean markdown with optimizations
config.clean_content = True
config.llm_optimized = True
converter = URLToMarkdownConverter(config)
result = await converter.convert_url(
"https://example.com",
save_to_file=True # Save markdown to file
)
clean_markdown = result.markdown
return {
"html": raw_html,
"markdown": clean_markdown,
"title": result.title,
"metadata": result.metadata
}
# Use the function
result = asyncio.run(get_html_and_markdown())
print(f"๐ HTML size: {len(result['html']):,} characters")
print(f"๐ Markdown size: {len(result['markdown']):,} characters")
print(f"๐ท๏ธ Title: {result['title']}")
#### Advanced Usage
```python
from url2md4ai import URLToMarkdownConverter, Config, URLHasher
# Custom configuration for specific content types
config = Config(
timeout=60,
wait_for_network_idle=True, # Wait for dynamic content
page_wait_timeout=2000, # Wait 2s for dynamic content
clean_content=True, # Remove ads/banners
llm_optimized=True, # Optimize for LLM processing
remove_cookie_banners=True,
remove_navigation=True,
remove_ads=True,
remove_social_media=True,
remove_comments=True,
output_dir="ai_content",
user_agent="MyAI/1.0"
)
converter = URLToMarkdownConverter(config)
# Convert with maximum cleaning for LLM processing
result = await converter.convert_url(
url="https://example.com",
use_trafilatura=True, # Use intelligent extraction
use_javascript=True, # Handle dynamic content
favor_precision=True, # Prefer precision over recall
include_tables=True, # Include table content
include_images=False, # Exclude image references
include_formatting=True # Preserve text formatting
)
if result.success:
# Perfect for feeding into LLMs
clean_content = result.markdown
metadata = result.metadata
print(f"๐ฏ Extraction quality: {result.extraction_method}")
print(f"๐ Content size: {result.file_size:,} chars")
print(f"๐งน Cleaned and ready for LLM processing!")
# Generate deterministic filenames
hash_value = URLHasher.generate_hash("https://example.com")
filename = URLHasher.generate_filename("https://example.com")
print(f"๐ Hash: {hash_value}, ๐ Filename: {filename}")
๐ Extraction Quality Examples
Before vs After: Real-World Results
# Complex job posting with cookie banners and ads
url2md4ai convert "https://company.com/careers/position" --show-metadata
Before (Raw HTML): 51KB, 797 lines
- โ Cookie consent banners
- โ Website navigation
- โ Social media widgets
- โ Advertising content
- โ Footer links and legal text
After (url2md4ai): 9KB, 69 lines
- โ Job title and description
- โ Key requirements
- โ Company benefits
- โ Application process
- โ 97% noise reduction!
Content Types Optimized for LLM
| Content Type | Extraction Quality | Best Settings |
|---|---|---|
| News Articles | โญโญโญโญโญ | --no-js (faster) |
| Job Postings | โญโญโญโญโญ | --force-js (complete) |
| Product Pages | โญโญโญโญ | --clean (essential) |
| Documentation | โญโญโญโญโญ | --raw (preserve structure) |
| Blog Posts | โญโญโญโญโญ | default settings |
| Social Media | โญโญโญ | --force-js required |
โ๏ธ Configuration
Environment Variables
# Content Extraction Settings
export URL2MD_CLEAN_CONTENT=true
export URL2MD_LLM_OPTIMIZED=true
export URL2MD_USE_TRAFILATURA=true
# Dynamic Content Settings
export URL2MD_WAIT_NETWORK=true
export URL2MD_PAGE_TIMEOUT=2000
export URL2MD_HEADLESS=true
# Content Filtering
export URL2MD_REMOVE_COOKIES=true
export URL2MD_REMOVE_NAV=true
export URL2MD_REMOVE_ADS=true
export URL2MD_REMOVE_SOCIAL=true
export URL2MD_REMOVE_COMMENTS=true
# Advanced Settings
export URL2MD_FAVOR_PRECISION=true
export URL2MD_INCLUDE_TABLES=true
export URL2MD_INCLUDE_IMAGES=false
export URL2MD_INCLUDE_FORMATTING=true
# Output Settings
export URL2MD_OUTPUT_DIR="output"
export URL2MD_USE_HASH_FILENAMES=true
# Performance & Reliability
export URL2MD_TIMEOUT=30
export URL2MD_MAX_RETRIES=3
export URL2MD_USER_AGENT="url2md4ai/1.0"
Configuration Options
| Option | Default | Description |
|---|---|---|
| Content Extraction | ||
clean_content |
true | Remove ads, banners, navigation |
llm_optimized |
true | Post-process for LLM consumption |
use_trafilatura |
true | Use intelligent text extraction |
| Dynamic Content | ||
wait_for_network_idle |
true | Wait for network activity to finish |
page_wait_timeout |
2000 | Wait time for dynamic content (ms) |
browser_headless |
true | Run browser in headless mode |
| Content Filtering | ||
remove_cookie_banners |
true | Remove cookie consent UI |
remove_navigation |
true | Remove nav menus and headers |
remove_ads |
true | Remove advertising content |
remove_social_media |
true | Remove social sharing widgets |
remove_comments |
true | Remove user comments |
| Advanced Settings | ||
favor_precision |
true | Prefer precision over recall |
include_tables |
true | Include table content |
include_images |
false | Include image references |
include_formatting |
true | Preserve text formatting |
| Output Settings | ||
output_dir |
"output" | Default output directory |
use_hash_filenames |
true | Generate deterministic filenames |
๐ณ Docker Usage
๐ See DOCKER_USAGE.md for comprehensive Docker usage examples and troubleshooting.
Quick Start with Docker
# Build the image
docker build -t url2md4ai .
# Convert single URL with LLM optimization
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://example.com" --show-metadata
# Convert dynamic content with JavaScript rendering
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
convert "https://spa-app.com" --force-js --show-metadata
# Batch processing with parallel workers
docker run --rm \
-v $(pwd)/output:/app/output \
url2md4ai \
batch "https://site1.com" "https://site2.com" --concurrency 5 --show-metadata
Using Docker Compose (Recommended)
# Start with compose for easier management
docker compose run --rm url2md4ai convert "https://example.com" --show-metadata
# Development mode with full environment
docker compose run --rm dev
# Batch processing example
docker compose run --rm url2md4ai \
batch "https://news.site.com/article1" "https://blog.site.com/post2" \
--concurrency 3 --continue-on-error --show-metadata
Custom Configuration
# Override LLM optimization settings
docker run --rm \
-v $(pwd)/output:/app/output \
-e URL2MD_CLEAN_CONTENT=false \
-e URL2MD_LLM_OPTIMIZED=false \
url2md4ai \
convert "https://example.com" --raw
# Disable JavaScript for faster processing
docker run --rm \
-v $(pwd)/output:/app/output \
-e URL2MD_JAVASCRIPT=false \
url2md4ai \
convert "https://static-site.com" --no-js
๐ ๏ธ Development
Setup Development Environment
# Clone repository
git clone https://github.com/mazzasaverio/url2md4ai.git
cd url2md4ai
# Install with uv
uv sync
# Install Playwright browsers
uv run playwright install
# Run tests
uv run pytest
# Run linting
uv run ruff check
uv run black --check .
Running Tests
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src/url2md4ai
# Run specific test
uv run pytest tests/test_converter.py
๐ Output Format
The tool generates clean, LLM-optimized markdown with:
- โ Preserved heading structure
- โ Clean link formatting
- โ Removed navigation, footer, and sidebar content
- โ Optimized whitespace and line breaks
- โ Title and metadata preservation
- โ Support for complex layouts
Example Output
# Page Title
Main content paragraph with [links](https://example.com) preserved.
## Section Heading
- List items preserved
- Proper formatting maintained
**Bold text** and *italic text* converted correctly.
> Blockquotes maintained
```code blocks preserved```
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Guidelines
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Code Quality
- Use
blackfor code formatting - Use
rufffor linting - Add type hints for all functions
- Write tests for new features
- Update documentation as needed
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Trafilatura for intelligent content extraction and web scraping
- Playwright for JavaScript rendering and dynamic content handling
- html2text for HTML to Markdown conversion
- Beautiful Soup for HTML parsing and content cleaning
- Click for the powerful CLI interface
- Loguru for elegant logging
๐ Roadmap
- Support for more output formats (PDF, DOCX)
- Custom CSS selector filtering
- Integration with popular LLM APIs
- Web UI interface
- Plugin system for custom processors
- Support for authentication-required pages
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file url2md4ai-0.0.3.tar.gz.
File metadata
- Download URL: url2md4ai-0.0.3.tar.gz
- Upload date:
- Size: 193.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1b3e57e70503aa19547d179002fdb3d2996b9cc5aae3b59f83d066fdbae32f6
|
|
| MD5 |
82a3ec6c44a5f1400a4cf987f6fd8bef
|
|
| BLAKE2b-256 |
fe03ae0cbbd92309afa80639e1cd0acb6f6a356156672a13bb1feeb787d56483
|
Provenance
The following attestation bundles were made for url2md4ai-0.0.3.tar.gz:
Publisher:
release.yml on mazzasaverio/url2md4ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
url2md4ai-0.0.3.tar.gz -
Subject digest:
a1b3e57e70503aa19547d179002fdb3d2996b9cc5aae3b59f83d066fdbae32f6 - Sigstore transparency entry: 257652676
- Sigstore integration time:
-
Permalink:
mazzasaverio/url2md4ai@54947971f74da2ff36e3615c6e51a34642e22cb9 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/mazzasaverio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@54947971f74da2ff36e3615c6e51a34642e22cb9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file url2md4ai-0.0.3-py3-none-any.whl.
File metadata
- Download URL: url2md4ai-0.0.3-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e49f05767a985bd09170392eccadcf099165b0148e8a3f906b1891cc8abfb61
|
|
| MD5 |
f445b4686f5c1bb34e6de1991dd3fa71
|
|
| BLAKE2b-256 |
fecbb508e0d79fbd8bcec0c35bddf24f3bd75ffbc8c401195ded7efa5d7d66a3
|
Provenance
The following attestation bundles were made for url2md4ai-0.0.3-py3-none-any.whl:
Publisher:
release.yml on mazzasaverio/url2md4ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
url2md4ai-0.0.3-py3-none-any.whl -
Subject digest:
7e49f05767a985bd09170392eccadcf099165b0148e8a3f906b1891cc8abfb61 - Sigstore transparency entry: 257652681
- Sigstore integration time:
-
Permalink:
mazzasaverio/url2md4ai@54947971f74da2ff36e3615c6e51a34642e22cb9 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/mazzasaverio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@54947971f74da2ff36e3615c6e51a34642e22cb9 -
Trigger Event:
push
-
Statement type: