Skip to main content

HTML→Markdown extractor optimized for LLM training corpora — zero noise artifacts, integrated quality scoring, low-value stub detection.

Project description

RDTextract

HTML→Markdown extractor built for AI / LLM training corpora — every byte should carry signal, not boilerplate.

Language scope (current): filters, paywall markers, and the is_low_value_stub() heuristic are tuned for French content (the corpus this lib was extracted from is FR-only). Core extraction is language-agnostic and works on any language; only the stub detector is FR-specific. Multi-language support is on the roadmap.

  • Zero noise artifacts — no double-bullets, no orphan punctuation, no link dumps, no widget repetition.
  • Integrated quality scoringis_low_value_stub() filters paywalls, login walls, skip-link stubs, empty pages.
  • Meta fallback — degrades gracefully on SPA / React pages where the body is empty (extracts <title> + meta description instead of returning nothing).
  • One dependency: beautifulsoup4.

Install

pip install RDTextract

Quick start

import rdtextract

html = open("page.html").read()

# One-shot
markdown = rdtextract.extract(html)

# Or two-step (re-use cleaned HTML for caching, debugging, etc.)
cleaned = rdtextract.clean_html(html)
markdown = rdtextract.to_markdown(cleaned)

# Filter low-value pages before writing to your corpus
if not rdtextract.is_low_value_stub(markdown):
    save(markdown)

Note : PyPI distribution is RDTextract, Python module is rdtextract (PEP 8 lowercase).

Why another HTML→Markdown lib?

Existing tools target human readability (newsletters, archive). RDTextract targets LLM training data: every byte should carry signal.

Benchmark on Tranco top-1000 .fr homepages (672 pages successfully fetched and processed), measured against a corpus quality scorer (lower artifact counts = cleaner output):

Extractor Quality (corrected) double-bullet orphan punct http dump para dup
RDTextract 98.3 0 0 0 0
html2text 96.2 1 148 70 2 373
trafilatura 96.1 18 780 10 309
markdownify 85.5 0 246 107 5 042

RDTextract wins on quality and on all 4 artifact metrics — zero artifacts on every counter. Reproducible: see benchmark/ — domain list is committed (Tranco, deterministic), HTML cache is .gitignored.

Trade-off: RDTextract is ~3× slower than html2text (filtering + walking + dedup) and filters more aggressively, so a handful of content-light landing pages return empty — by design.

API

clean_html(html: str) -> str

Strip nav/footer/scripts/ads/hidden elements. Drops responsive duplicates (mobile/desktop variants), icon font ligatures, role-based junk (role=navigation, role=banner, …).

to_markdown(cleaned_html: str) -> str

Walk the cleaned tree and emit Markdown. Includes:

  • Heading levels, lists (with sublist flattening), tables (colspan, nested), code blocks, blockquotes, definition lists.
  • Post-processing: collapse whitespace, restore breadcrumb separators, dedup consecutive blocks, dedup global blocks (UI widgets repeated ≥3×).
  • Meta fallback for SPA pages.

extract(html: str) -> str

Convenience: to_markdown(clean_html(html)).

is_low_value_stub(markdown: str) -> bool

True if the markdown is a paywall (FR markers), login stub, MediaWiki skip-link alone, or empty. Combined length + marker check (avoids false positives on real articles that happen to mention these phrases).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdtextract-0.1.1.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdtextract-0.1.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file rdtextract-0.1.1.tar.gz.

File metadata

  • Download URL: rdtextract-0.1.1.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for rdtextract-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e6aeb16efc7ed06cb4b3b5c27fcfc0041851cd588f9317854ad8acd2f58b92db
MD5 86b6131db78656d9a51a1095979f51af
BLAKE2b-256 ccf1a6cfb1b1ea5048cbefc401661bf7260414223ca7becf057ceaaba5560af2

See more details on using hashes here.

File details

Details for the file rdtextract-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rdtextract-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for rdtextract-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c0ac3d87dcdc800996d758dbcd7a0de7a3898c1177ad8d8068b97ca650223c41
MD5 c838b51ba60a0ba40fbbaa670bbf6fa4
BLAKE2b-256 9c4c47c7d09f20df5ed2afa12f6cb5fea0d3a53ed29d4454adbfb5359cc54888

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page