Convert URLs to clean Markdown via a multi-tool fallback cascade
Project description
url22md
Convert HTTP(S) URLs to clean Markdown files. Tries six extraction tools in cascade order, scores quality, and keeps the best result. Handles hundreds of URLs concurrently, tracks progress in a crash-safe JSONL report, and skips already-processed URLs on restart.
Install
pip install url22md
For full functionality (JS-rendered pages), also run:
playwright install chromium
crawl4ai-setup
Quick start
Single URL:
url22md --url "https://example.com/article"
Batch from file:
url22md --urls_path urls.txt --output_dir ./output
Pipe from stdin:
cat urls.txt | url22md --output_dir ./output
Force a specific tool:
url22md --url "https://spa-heavy-site.com" --tool 3
How it works
url22md tries up to six extraction tools in order. After each attempt it scores the Markdown output on length, headings, paragraph structure, links, and residual HTML. The first result scoring above the quality threshold (0.3) is accepted. If nothing meets the threshold, the best result is kept anyway.
| # | Tool | JS rendering | Speed | Needs |
|---|---|---|---|---|
| 1 | trafilatura | No | Fast | nothing |
| 2 | crawl4ai | Yes (headless browser) | Good | crawl4ai-setup |
| 3 | playwright + markdownify | Yes (real Chromium) | Good | playwright install chromium |
| 4 | firecrawl | Yes (cloud, anti-bot) | Good | FIRECRAWL_API_KEY |
| 5 | Jina Reader | Yes (cloud) | Fast | JINA_API_KEY |
| 6 | readability-lxml + markdownify | No | Fast | nothing |
Tools 1, 2, 3, and 6 run locally. Tools 4 and 5 call cloud APIs and require API keys.
CLI reference
url22md [flags]
Input (at least one required):
| Flag | Description |
|---|---|
--url URL |
Single URL to convert |
--urls_path FILE |
Text file with one URL per line |
| (stdin) | Pipe URLs, one per line |
Output:
| Flag | Default | Description |
|---|---|---|
--output_dir DIR |
. |
Directory for .md files |
--jsonl PATH |
DIR/_url2md.jsonl |
JSONL progress report path |
Extraction control:
| Flag | Default | Description |
|---|---|---|
--tool N |
cascade 1-6 | Force a specific tool (1-6) |
--proxy |
off | Route through Webshare proxy |
--concurrency N |
5 | Max parallel URL conversions |
--timeout N |
30 | Per-tool timeout in seconds |
Housekeeping:
| Flag | Description |
|---|---|
--clean |
Delete existing JSONL report before starting |
--clean_all |
Delete report and all .md files listed in it |
--verbose |
Debug-level logging to stderr |
JSONL report
Each processed URL appends one JSON line to the report immediately (crash-safe):
{"url": "https://example.com", "filename": "example-com.md", "tool": "trafilatura", "success": true, "quality": 0.7, "error": null, "timestamp": "2026-03-26T22:15:00+00:00"}
On the next run, URLs already in the report are skipped. Use --clean to start fresh.
Python API
from url22md import run_conversion
summary = run_conversion(
urls=["https://example.com", "https://docs.python.org/3/"],
output_dir="./output",
concurrency=10,
tool=1, # optional: force trafilatura only
verbose=True,
)
print(summary)
# {"total": 2, "processed": 2, "skipped": 0, "succeeded": 2, "failed": 0}
Lower-level access:
import asyncio
from url22md.tools import extract_with_trafilatura, extract_with_playwright
async def main():
result = await extract_with_trafilatura("https://example.com")
if not result.success:
result = await extract_with_playwright("https://example.com")
print(result.markdown)
asyncio.run(main())
Quality scoring:
from url22md import assess_quality
score = assess_quality("# Title\n\nA paragraph.\n\nAnother paragraph.")
print(score) # 0.6
Proxy support
url22md supports Webshare proxies. Set these environment variables:
export WEBSHARE_PROXY_USER="your_user"
export WEBSHARE_PROXY_PASS="your_pass"
export WEBSHARE_DOMAIN_NAME="proxy.webshare.io"
export WEBSHARE_PROXY_PORT="80"
Then pass --proxy:
url22md --url "https://geo-restricted-site.com" --proxy
Cloud API keys
For tools 4 and 5, set the corresponding environment variable:
export FIRECRAWL_API_KEY="fc-..." # tool 4
export JINA_API_KEY="jina_..." # tool 5
These tools are only attempted when their API key is present. Without them, the cascade skips to the next tool.
Output structure
output_dir/
example-com.md # Markdown content
docs-python-org-3.md # Markdown content
_url2md.jsonl # Progress report
Filenames are generated from URLs using slugify + pathvalidate, producing filesystem-safe names like example-com-article-id-42.md.
Development
git clone https://github.com/twardoch/url22md
cd url22md
uv pip install --system -e .
playwright install chromium
# Tests (48 unit tests, no network required)
uvx hatch test
# Lint
uvx ruff check src/url22md/ tests/
# Build
uvx hatch build
# Publish
uv publish
Versioning is derived from git tags via hatch-vcs. Tag v1.2.3 produces version 1.2.3.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file url22md-1.0.8.tar.gz.
File metadata
- Download URL: url22md-1.0.8.tar.gz
- Upload date:
- Size: 25.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd2be880b72b2e99891f050e397984ab747f748eb735ed9aedeb46f4a2e8139e
|
|
| MD5 |
a91f246b498741fe809b4a9aa714d5ee
|
|
| BLAKE2b-256 |
d8b8d15b714e44f804c4dc48bb45c23957424e2f89d8630bdb9d61696ccbeb88
|
File details
Details for the file url22md-1.0.8-py3-none-any.whl.
File metadata
- Download URL: url22md-1.0.8-py3-none-any.whl
- Upload date:
- Size: 18.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68911921eb6a3f78181a6f326f0e1aed11a82602ecbfa6a426778050011d8fb7
|
|
| MD5 |
912557661eaa582165f9181c61bc7f66
|
|
| BLAKE2b-256 |
5dfbb7ecc1c2a768a7de0a7ce438777ef26ec782589cfbbe273f5a5c45bb79ec
|