HTML→Markdown extractor optimized for LLM training corpora — zero noise artifacts, integrated quality scoring, low-value stub detection.

These details have not been verified by PyPI

Project links

Project description

RDTextract

HTML→Markdown extractor built for AI / LLM training corpora — every byte should carry signal, not boilerplate.

Language scope (current): filters, paywall markers, and the is_low_value_stub() heuristic are tuned for French content (the corpus this lib was extracted from is FR-only). Core extraction is language-agnostic and works on any language; only the stub detector is FR-specific. Multi-language support is on the roadmap.

Zero noise artifacts — no double-bullets, no orphan punctuation, no link dumps, no widget repetition.
Integrated quality scoring — is_low_value_stub() filters paywalls, login walls, skip-link stubs, empty pages.
Meta fallback — degrades gracefully on SPA / React pages where the body is empty (extracts <title> + meta description instead of returning nothing).
One dependency: beautifulsoup4.

Install

pip install RDTextract

Quick start

import rdtextract

html = open("page.html").read()

# One-shot
markdown = rdtextract.extract(html)

# Or two-step (re-use cleaned HTML for caching, debugging, etc.)
cleaned = rdtextract.clean_html(html)
markdown = rdtextract.to_markdown(cleaned)

# Filter low-value pages before writing to your corpus
if not rdtextract.is_low_value_stub(markdown):
    save(markdown)

Note : PyPI distribution is RDTextract, Python module is rdtextract (PEP 8 lowercase).

Why another HTML→Markdown lib?

Existing tools target human readability (newsletters, archive). RDTextract targets LLM training data: every byte should carry signal.

Benchmark on Tranco top-1000 .fr homepages (672 pages successfully fetched and processed), measured against a corpus quality scorer (lower artifact counts = cleaner output):

Extractor	Quality (corrected)	double-bullet	orphan punct	http dump	para dup
RDTextract	98.3	0	0	0	0
html2text	96.2	1	148	70	2 373
trafilatura	96.1	18	780	10	309
markdownify	85.5	0	246	107	5 042

RDTextract wins on quality and on all 4 artifact metrics — zero artifacts on every counter. Reproducible: see benchmark/ — domain list is committed (Tranco, deterministic), HTML cache is .gitignored.

Trade-off: RDTextract is ~3× slower than html2text (filtering + walking + dedup) and filters more aggressively, so a handful of content-light landing pages return empty — by design.

API

`clean_html(html: str) -> str`

Strip nav/footer/scripts/ads/hidden elements. Drops responsive duplicates (mobile/desktop variants), icon font ligatures, role-based junk (role=navigation, role=banner, …).

`to_markdown(cleaned_html: str) -> str`

Walk the cleaned tree and emit Markdown. Includes:

Heading levels, lists (with sublist flattening), tables (colspan, nested), code blocks, blockquotes, definition lists.
Post-processing: collapse whitespace, restore breadcrumb separators, dedup consecutive blocks, dedup global blocks (UI widgets repeated ≥3×).
Meta fallback for SPA pages.

`extract(html: str) -> str`

Convenience: to_markdown(clean_html(html)).

`is_low_value_stub(markdown: str) -> bool`

True if the markdown is a paywall (FR markers), login stub, MediaWiki skip-link alone, or empty. Combined length + marker check (avoids false positives on real articles that happen to mention these phrases).

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Apr 19, 2026

0.1.0 yanked

Apr 19, 2026

Reason this release was yanked:

Token visible

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdtextract-0.1.1.tar.gz (11.9 kB view details)

Uploaded Apr 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rdtextract-0.1.1-py3-none-any.whl (12.1 kB view details)

Uploaded Apr 19, 2026 Python 3

File details

Details for the file rdtextract-0.1.1.tar.gz.

File metadata

Download URL: rdtextract-0.1.1.tar.gz
Upload date: Apr 19, 2026
Size: 11.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for rdtextract-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e6aeb16efc7ed06cb4b3b5c27fcfc0041851cd588f9317854ad8acd2f58b92db`
MD5	`86b6131db78656d9a51a1095979f51af`
BLAKE2b-256	`ccf1a6cfb1b1ea5048cbefc401661bf7260414223ca7becf057ceaaba5560af2`

See more details on using hashes here.

File details

Details for the file rdtextract-0.1.1-py3-none-any.whl.

File metadata

Download URL: rdtextract-0.1.1-py3-none-any.whl
Upload date: Apr 19, 2026
Size: 12.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for rdtextract-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0ac3d87dcdc800996d758dbcd7a0de7a3898c1177ad8d8068b97ca650223c41`
MD5	`c838b51ba60a0ba40fbbaa670bbf6fa4`
BLAKE2b-256	`9c4c47c7d09f20df5ed2afa12f6cb5fea0d3a53ed29d4454adbfb5359cc54888`

See more details on using hashes here.

RDTextract 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RDTextract

Install

Quick start

Why another HTML→Markdown lib?

API

`clean_html(html: str) -> str`

`to_markdown(cleaned_html: str) -> str`

`extract(html: str) -> str`

`is_low_value_stub(markdown: str) -> bool`

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes