HTML→Markdown extractor optimized for LLM training corpora — zero noise artifacts, integrated quality scoring, low-value stub detection.
Project description
RDTextract
HTML→Markdown extractor built for AI / LLM training corpora — every byte should carry signal, not boilerplate.
Language scope (current): filters, paywall markers, and the
is_low_value_stub()heuristic are tuned for French content (the corpus this lib was extracted from is FR-only). Core extraction is language-agnostic and works on any language; only the stub detector is FR-specific. Multi-language support is on the roadmap.
- Zero noise artifacts — no double-bullets, no orphan punctuation, no link dumps, no widget repetition.
- Integrated quality scoring —
is_low_value_stub()filters paywalls, login walls, skip-link stubs, empty pages. - Meta fallback — degrades gracefully on SPA / React pages where the body is empty (extracts
<title>+ meta description instead of returning nothing). - One dependency:
beautifulsoup4.
Install
pip install RDTextract
Quick start
import rdtextract
html = open("page.html").read()
# One-shot
markdown = rdtextract.extract(html)
# Or two-step (re-use cleaned HTML for caching, debugging, etc.)
cleaned = rdtextract.clean_html(html)
markdown = rdtextract.to_markdown(cleaned)
# Filter low-value pages before writing to your corpus
if not rdtextract.is_low_value_stub(markdown):
save(markdown)
Note : PyPI distribution is
RDTextract, Python module isrdtextract(PEP 8 lowercase).
Why another HTML→Markdown lib?
Existing tools target human readability (newsletters, archive). RDTextract targets LLM training data: every byte should carry signal.
Benchmark on Tranco top-1000 .fr homepages (672 pages successfully fetched and processed), measured against a corpus quality scorer (lower artifact counts = cleaner output):
| Extractor | Quality (corrected) | double-bullet | orphan punct | http dump | para dup |
|---|---|---|---|---|---|
| RDTextract | 98.3 | 0 | 0 | 0 | 0 |
| html2text | 96.2 | 1 | 148 | 70 | 2 373 |
| trafilatura | 96.1 | 18 | 780 | 10 | 309 |
| markdownify | 85.5 | 0 | 246 | 107 | 5 042 |
RDTextract wins on quality and on all 4 artifact metrics — zero artifacts on every counter. Reproducible: see benchmark/ — domain list is committed (Tranco, deterministic), HTML cache is .gitignored.
Trade-off: RDTextract is ~3× slower than html2text (filtering + walking + dedup) and filters more aggressively, so a handful of content-light landing pages return empty — by design.
API
clean_html(html: str) -> str
Strip nav/footer/scripts/ads/hidden elements. Drops responsive duplicates (mobile/desktop variants), icon font ligatures, role-based junk (role=navigation, role=banner, …).
to_markdown(cleaned_html: str) -> str
Walk the cleaned tree and emit Markdown. Includes:
- Heading levels, lists (with sublist flattening), tables (colspan, nested), code blocks, blockquotes, definition lists.
- Post-processing: collapse whitespace, restore breadcrumb separators, dedup consecutive blocks, dedup global blocks (UI widgets repeated ≥3×).
- Meta fallback for SPA pages.
extract(html: str) -> str
Convenience: to_markdown(clean_html(html)).
is_low_value_stub(markdown: str) -> bool
True if the markdown is a paywall (FR markers), login stub, MediaWiki skip-link alone, or empty. Combined length + marker check (avoids false positives on real articles that happen to mention these phrases).
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rdtextract-0.1.1.tar.gz.
File metadata
- Download URL: rdtextract-0.1.1.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6aeb16efc7ed06cb4b3b5c27fcfc0041851cd588f9317854ad8acd2f58b92db
|
|
| MD5 |
86b6131db78656d9a51a1095979f51af
|
|
| BLAKE2b-256 |
ccf1a6cfb1b1ea5048cbefc401661bf7260414223ca7becf057ceaaba5560af2
|
File details
Details for the file rdtextract-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rdtextract-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0ac3d87dcdc800996d758dbcd7a0de7a3898c1177ad8d8068b97ca650223c41
|
|
| MD5 |
c838b51ba60a0ba40fbbaa670bbf6fa4
|
|
| BLAKE2b-256 |
9c4c47c7d09f20df5ed2afa12f6cb5fea0d3a53ed29d4454adbfb5359cc54888
|