Skip to main content

Python package to convert HTTP(S) URLs to Markdown

Project description

url22md

Convert HTTP(S) URLs to clean Markdown files. Tries six extraction tools in cascade order, scores quality, and keeps the best result. Handles hundreds of URLs concurrently, tracks progress in a crash-safe JSONL report, and skips already-processed URLs on restart.

Install

pip install url22md

For full functionality (JS-rendered pages), also run:

playwright install chromium
crawl4ai-setup

Quick start

Single URL:

url22md --url "https://example.com/article"

Batch from file:

url22md --urls_path urls.txt --output_dir ./output

Pipe from stdin:

cat urls.txt | url22md --output_dir ./output

Force a specific tool:

url22md --url "https://spa-heavy-site.com" --tool 3

How it works

url22md tries up to six extraction tools in order. After each attempt it scores the Markdown output on length, headings, paragraph structure, links, and residual HTML. The first result scoring above the quality threshold (0.3) is accepted. If nothing meets the threshold, the best result is kept anyway.

# Tool JS rendering Speed Needs
1 trafilatura No Fast nothing
2 crawl4ai Yes (headless browser) Good crawl4ai-setup
3 playwright + markdownify Yes (real Chromium) Good playwright install chromium
4 firecrawl Yes (cloud, anti-bot) Good FIRECRAWL_API_KEY
5 Jina Reader Yes (cloud) Fast JINA_API_KEY
6 readability-lxml + markdownify No Fast nothing

Tools 1, 2, 3, and 6 run locally. Tools 4 and 5 call cloud APIs and require API keys.

CLI reference

url22md [flags]

Input (at least one required):

Flag Description
--url URL Single URL to convert
--urls_path FILE Text file with one URL per line
(stdin) Pipe URLs, one per line

Output:

Flag Default Description
--output_dir DIR . Directory for .md files
--jsonl PATH DIR/_url2md.jsonl JSONL progress report path

Extraction control:

Flag Default Description
--tool N cascade 1-6 Force a specific tool (1-6)
--proxy off Route through Webshare proxy
--concurrency N 5 Max parallel URL conversions
--timeout N 30 Per-tool timeout in seconds

Housekeeping:

Flag Description
--clean Delete existing JSONL report before starting
--clean_all Delete report and all .md files listed in it
--verbose Debug-level logging to stderr

JSONL report

Each processed URL appends one JSON line to the report immediately (crash-safe):

{"url": "https://example.com", "filename": "example-com.md", "tool": "trafilatura", "success": true, "quality": 0.7, "error": null, "timestamp": "2026-03-26T22:15:00+00:00"}

On the next run, URLs already in the report are skipped. Use --clean to start fresh.

Python API

from url22md import run_conversion

summary = run_conversion(
    urls=["https://example.com", "https://docs.python.org/3/"],
    output_dir="./output",
    concurrency=10,
    tool=1,          # optional: force trafilatura only
    verbose=True,
)
print(summary)
# {"total": 2, "processed": 2, "skipped": 0, "succeeded": 2, "failed": 0}

Lower-level access:

import asyncio
from url22md.tools import extract_with_trafilatura, extract_with_playwright

async def main():
    result = await extract_with_trafilatura("https://example.com")
    if not result.success:
        result = await extract_with_playwright("https://example.com")
    print(result.markdown)

asyncio.run(main())

Quality scoring:

from url22md import assess_quality

score = assess_quality("# Title\n\nA paragraph.\n\nAnother paragraph.")
print(score)  # 0.6

Proxy support

url22md supports Webshare proxies. Set these environment variables:

export WEBSHARE_PROXY_USER="your_user"
export WEBSHARE_PROXY_PASS="your_pass"
export WEBSHARE_DOMAIN_NAME="proxy.webshare.io"
export WEBSHARE_PROXY_PORT="80"

Then pass --proxy:

url22md --url "https://geo-restricted-site.com" --proxy

Cloud API keys

For tools 4 and 5, set the corresponding environment variable:

export FIRECRAWL_API_KEY="fc-..."    # tool 4
export JINA_API_KEY="jina_..."       # tool 5

These tools are only attempted when their API key is present. Without them, the cascade skips to the next tool.

Output structure

output_dir/
  example-com.md                     # Markdown content
  docs-python-org-3.md               # Markdown content
  _url2md.jsonl                      # Progress report

Filenames are generated from URLs using slugify + pathvalidate, producing filesystem-safe names like example-com-article-id-42.md.

Development

git clone https://github.com/twardoch/url22md
cd url22md
uv pip install --system -e .
playwright install chromium

# Tests (48 unit tests, no network required)
uvx hatch test

# Lint
uvx ruff check src/url22md/ tests/

# Build
uvx hatch build

# Publish
uv publish

Versioning is derived from git tags via hatch-vcs. Tag v1.2.3 produces version 1.2.3.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url22md-1.0.7.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

url22md-1.0.7-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file url22md-1.0.7.tar.gz.

File metadata

  • Download URL: url22md-1.0.7.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for url22md-1.0.7.tar.gz
Algorithm Hash digest
SHA256 732dc6505d9fd1148a5834c343e732e8d06b11b85a334e488a8bf17e22b7775d
MD5 1cdebc10fbc43ffd3013f0a208e0464b
BLAKE2b-256 db441f4f5f23f1dfa991c0a947b5870053da8f50e95b1b7865495bff51def57c

See more details on using hashes here.

File details

Details for the file url22md-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: url22md-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for url22md-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8632434bfeb577bb2f0a37a72e49d219e732cd0bfd4dfe3f97322f7fd22696c6
MD5 8ab291ad45a890e0ea2a303d130ce370
BLAKE2b-256 a5ced23940e3cbf453b44e23854f823f5e317213a7de492c7f0a0aecea98ad40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page