Convert URLs to clean Markdown via a multi-tool fallback cascade

These details have not been verified by PyPI

Project links

Project description

url22md

Convert HTTP(S) URLs to clean Markdown files. Uses seven extraction tools with directed fallback chains, scores quality, and keeps the best result. Handles hundreds of URLs concurrently, tracks progress in a crash-safe JSONL report, and skips already-processed URLs on restart.

Install

pip install url22md

For full functionality (JS-rendered pages), also run:

playwright install chromium
crawl4ai-setup

Quick start

Single URL:

url22md --url "https://example.com/article"

Batch from file:

url22md --urls_path urls.txt --output_dir ./output

Pipe from stdin:

cat urls.txt | url22md --output_dir ./output

Force a specific tool:

url22md --url "https://spa-heavy-site.com" --tool 3

How it works

url22md tries extraction tools following a directed fallback chain. Each tool has a specific next-tool on failure (not a simple linear sequence). After each attempt it scores the Markdown on prose word count, sentence structure, and penalises CSS/JS boilerplate. The first result scoring above the quality threshold (0.5) is accepted. If nothing meets the threshold, the best result is kept anyway.

#	Tool	JS	Speed	Fallback	Needs
1	trafilatura	No	Fast	→ 5	nothing
2	trafilatura (strict)	No	Fast	none	nothing
3	readability-lxml + markdownify	No	Fast	→ 5	nothing
4	readability + markdownify (strict)	No	Fast	none	nothing
5	playwright + markdownify	Yes	Good	→ 6	`playwright install chromium`
6	firecrawl	Yes	Good	→ 7	`FIRECRAWL_API_KEY`
7	Jina Reader	Yes	Fast	→ 2	`JINA_API_KEY`
8	crawl4ai	Yes	Good	→ 6	`crawl4ai-setup`
9	crawl4ai (fit)	Yes	Good	→ 5	`crawl4ai-setup`

Tools 1-5, 8-9 run locally. Tools 6 and 7 call cloud APIs and require API keys. Tool 8 uses crawl4ai with stealth mode and anti-bot features (magic, user simulation, navigator override). Tool 9 adds a PruningContentFilter for article-only fit_markdown output. The default cascade starting from tool 1 follows: 1 → 5 → 6 → 7 → 2 (stop).

CLI reference

url22md [flags]

Input (at least one required):

Flag	Description
`--url URL`	Single URL to convert
`--urls_path FILE`	Text file with one URL per line
(stdin)	Pipe URLs, one per line

Output:

Flag	Default	Description
`--format FMT`	`md`	Output format (see below)
`--output_dir DIR`	`.`	Directory for output files
`--jsonl PATH`	`DIR/_url2md.jsonl`	JSONL progress report path

Output formats:

Format	Behaviour
`md`	One `.md` file per URL + JSONL report (default)
`all`	All results in a single `combined.md` (HR + H1 URL separators) + JSONL report
`json`	JSONL report with `markdown` content included (no `.md` files)
`-`	JSONL with `markdown` to stdout (no files written)

Extraction control:

Flag	Default	Description
`--tool N`	1 (cascade)	Start from tool N (1-9), follows fallback chain
`--proxy`	off	Route through Webshare proxy
`--Jobs N`	5	Max parallel URL conversions
`--Timeout N`	30	Per-tool timeout in seconds

Housekeeping:

Flag	Description
`--Force`	Re-process URLs even if already in the JSONL report
`--minify`	Article-only extraction: readability for tools 1-4, pruning for crawl4ai
`--clean`	Delete existing JSONL report before starting
`--Clean_all`	Delete report and all output files listed in it
`--verbose`	Debug-level logging to stderr

JSONL report

Each record contains:

{"url": "https://example.com", "filename": "example-com.md", "tool": "trafilatura", "success": true, "quality": 0.7, "error": null, "timestamp": "2026-03-26T22:15:00+00:00"}

With --format json or --format -, a "markdown" key with the full content is included. On the next run, URLs already in the report are skipped. Use --clean to start fresh.

Python API

run_conversion() returns a list of result records. Without format, no files are written:

from url22md import run_conversion

records = run_conversion(
    urls=["https://example.com", "https://docs.python.org/3/"],
    concurrency=10,
)
for rec in records:
    print(rec["url"], rec["tool"], rec["quality"])
    print(rec["markdown"][:200])

Each record dict has: url, filename, tool, success, quality, error, markdown, timestamp.

To also write files, pass format:

records = run_conversion(
    urls=["https://example.com"],
    output_dir="./output",
    format="md",       # writes individual .md files + JSONL report
)

Lower-level async access:

import asyncio
from url22md import convert_single_url

async def main():
    result = await convert_single_url("https://example.com", proxy_url=None)
    print(result.tool_name, result.quality_score)
    print(result.markdown)

asyncio.run(main())

Individual tool functions:

import asyncio
from url22md.tools import extract_with_trafilatura, extract_with_playwright

async def main():
    result = await extract_with_trafilatura("https://example.com")
    if not result.success:
        result = await extract_with_playwright("https://example.com")
    print(result.markdown)

asyncio.run(main())

Quality scoring:

from url22md import assess_quality

score = assess_quality("# Title\n\nA paragraph.\n\nAnother paragraph.")
print(score)  # 0.6

Proxy support

url22md supports Webshare proxies. Set these environment variables:

export WEBSHARE_PROXY_USER="your_user"
export WEBSHARE_PROXY_PASS="your_pass"
export WEBSHARE_DOMAIN_NAME="proxy.webshare.io"
export WEBSHARE_PROXY_PORT="80"

Then pass --proxy:

url22md --url "https://geo-restricted-site.com" --proxy

Cloud API keys

For tools 4 and 5, set the corresponding environment variable:

export FIRECRAWL_API_KEY="fc-..."    # tool 4
export JINA_API_KEY="jina_..."       # tool 5

These tools are only attempted when their API key is present. Without them, the cascade skips to the next tool.

Output structure

output_dir/
  example-com.md                     # Markdown content
  docs-python-org-3.md               # Markdown content
  _url2md.jsonl                      # Progress report

Filenames are generated from URLs using slugify + pathvalidate, producing filesystem-safe names like example-com-article-id-42.md.

Development

git clone https://github.com/twardoch/url22md
cd url22md
uv pip install --system -e .
playwright install chromium

# Tests (48 unit tests, no network required)
uvx hatch test

# Lint
uvx ruff check src/url22md/ tests/

# Build
uvx hatch build

# Publish
uv publish

Versioning is derived from git tags via hatch-vcs. Tag v1.2.3 produces version 1.2.3.

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.11.dev1 pre-release

Mar 27, 2026

This version

1.0.10

Mar 27, 2026

1.0.9

Mar 26, 2026

1.0.8

Mar 26, 2026

1.0.7

Mar 26, 2026

1.0.6

Mar 26, 2026

1.0.5

Mar 26, 2026

1.0.5.dev2 pre-release

Mar 26, 2026

1.0.5.dev1 pre-release

Mar 26, 2026

1.0.5.dev0 pre-release

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url22md-1.0.10.tar.gz (33.0 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

url22md-1.0.10-py3-none-any.whl (22.2 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file url22md-1.0.10.tar.gz.

File metadata

Download URL: url22md-1.0.10.tar.gz
Upload date: Mar 27, 2026
Size: 33.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for url22md-1.0.10.tar.gz
Algorithm	Hash digest
SHA256	`dbf3bbdaf9a2830a9281bc44fdbafab2bc6a2343f92489e9c9167edcf3afd256`
MD5	`fa8a8ff1b7afef41e30792b67f5e1a96`
BLAKE2b-256	`bb5ae972f4b08ad72d715c65c333c5bb7a77bf06630cdb23b2d0c18b7dde2c3f`

See more details on using hashes here.

File details

Details for the file url22md-1.0.10-py3-none-any.whl.

File metadata

Download URL: url22md-1.0.10-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for url22md-1.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`531a00200ebb01cd45d79596e6017fa9212074d15b1ea7aa30d1e8601376b88b`
MD5	`78897de2ad051f1830e8f6f15361ff18`
BLAKE2b-256	`2c1a0e5edcd26a31bb1ca6a4232e4fdcfbe2ef8b7f7808599836b6f2b873dfcc`

See more details on using hashes here.

url22md 1.0.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

url22md

Install

Quick start

How it works

CLI reference

JSONL report

Python API

Proxy support

Cloud API keys

Output structure

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes