Skip to main content

HTML→Markdown extractor optimized for LLM training corpora — zero noise artifacts, integrated quality scoring, low-value stub detection.

Reason this release was yanked:

Token visible

Project description

RDTextract

HTML→Markdown extractor built for AI / LLM training corpora — every byte should carry signal, not boilerplate.

Language scope (current): filters, paywall markers, and the is_low_value_stub() heuristic are tuned for French content (the corpus this lib was extracted from is FR-only). Core extraction is language-agnostic and works on any language; only the stub detector is FR-specific. Multi-language support is on the roadmap.

  • Zero noise artifacts — no double-bullets, no orphan punctuation, no link dumps, no widget repetition.
  • Integrated quality scoringis_low_value_stub() filters paywalls, login walls, skip-link stubs, empty pages.
  • Meta fallback — degrades gracefully on SPA / React pages where the body is empty (extracts <title> + meta description instead of returning nothing).
  • One dependency: beautifulsoup4.

Install

pip install RDTextract

Quick start

import rdtextract

html = open("page.html").read()

# One-shot
markdown = rdtextract.extract(html)

# Or two-step (re-use cleaned HTML for caching, debugging, etc.)
cleaned = rdtextract.clean_html(html)
markdown = rdtextract.to_markdown(cleaned)

# Filter low-value pages before writing to your corpus
if not rdtextract.is_low_value_stub(markdown):
    save(markdown)

Note : PyPI distribution is RDTextract, Python module is rdtextract (PEP 8 lowercase).

Why another HTML→Markdown lib?

Existing tools target human readability (newsletters, archive). RDTextract targets LLM training data: every byte should carry signal.

Benchmark on Tranco top-1000 .fr homepages (672 pages successfully fetched and processed), measured against a corpus quality scorer (lower artifact counts = cleaner output):

Extractor Quality (corrected) double-bullet orphan punct http dump para dup
RDTextract 98.3 0 0 0 0
html2text 96.2 1 148 70 2 373
trafilatura 96.1 18 780 10 309
markdownify 85.5 0 246 107 5 042

RDTextract wins on quality and on all 4 artifact metrics — zero artifacts on every counter. Reproducible: see benchmark/ — domain list is committed (Tranco, deterministic), HTML cache is .gitignored.

Trade-off: RDTextract is ~3× slower than html2text (filtering + walking + dedup) and filters more aggressively, so a handful of content-light landing pages return empty — by design.

API

clean_html(html: str) -> str

Strip nav/footer/scripts/ads/hidden elements. Drops responsive duplicates (mobile/desktop variants), icon font ligatures, role-based junk (role=navigation, role=banner, …).

to_markdown(cleaned_html: str) -> str

Walk the cleaned tree and emit Markdown. Includes:

  • Heading levels, lists (with sublist flattening), tables (colspan, nested), code blocks, blockquotes, definition lists.
  • Post-processing: collapse whitespace, restore breadcrumb separators, dedup consecutive blocks, dedup global blocks (UI widgets repeated ≥3×).
  • Meta fallback for SPA pages.

extract(html: str) -> str

Convenience: to_markdown(clean_html(html)).

is_low_value_stub(markdown: str) -> bool

True if the markdown is a paywall (FR markers), login stub, MediaWiki skip-link alone, or empty. Combined length + marker check (avoids false positives on real articles that happen to mention these phrases).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdtextract-0.1.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdtextract-0.1.0-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file rdtextract-0.1.0.tar.gz.

File metadata

  • Download URL: rdtextract-0.1.0.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for rdtextract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4a2268a8ef430076670e49b37372a907aa939a27fdba1d7257109dd9d90856c2
MD5 9cc40aeea8c5f9b7a59bd1be6271574f
BLAKE2b-256 2e37e7f8965128124f2cfdea996978ed851f3ed261bd24eb5b43f2285a4735bb

See more details on using hashes here.

File details

Details for the file rdtextract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rdtextract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for rdtextract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 202752bcc8bcf578fba6bc20c9d6093cd7da5d3a87975f82e987866e43a50573
MD5 e1a88d27846c1cf8ab0eb7a9dc8ff8ae
BLAKE2b-256 51379c9faa15a67420e70e518a35fc4f78fbe538611f1ebc7967fbaba38746ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page