Python reimplementation of the Web2Text pipeline for labeling HTML DOM nodes as content or boilerplate

These details have not been verified by PyPI

Project links

Project description

web2textpy

Python reimplementation of the Web2Text pipeline for labeling HTML DOM nodes as content or boilerplate using paired (raw_html, clean_text) data.

Installation

uv add web2textpy

Quick Start

from datasets import load_dataset
from web2text import run_pipeline

ds = load_dataset("williambrach/html-boilerplate-labeled", split="test")
row = ds[0]

tree, extracted_text, metrics = run_pipeline(row["html"], row["text"])

print(extracted_text[:200])
print(metrics)

Step-by-Step API

Each stage of the pipeline is exposed as a standalone function:

from web2text import build_cdom, extract_leaves, align, label_nodes, extract_text, evaluate

# 1. Parse HTML into a collapsed DOM tree
tree = build_cdom(html_string)

# 2. Extract ordered text-bearing leaf nodes
leaves = extract_leaves(tree)  # [(element, "normalized text"), ...]

# 3. Align leaf texts against ground-truth clean text
scores = align(leaves, clean_text)  # {leaf_id: 0.0-1.0 match score}

# 4. Label each node as "content" or "boilerplate"
tree = label_nodes(tree, scores, threshold=0.667)

# 5. Extract text from content-labeled nodes
result = extract_text(tree)

# 6. Evaluate against ground truth
metrics = evaluate(result, clean_text)
# => {'token_f1': 0.99, 'precision': 0.99, 'recall': 0.99, 'rouge1_f': 0.99, 'bleu': 98.5, 'chrf': 98.8}

How the Matching Algorithm Works

Given raw HTML and its known clean text, the algorithm determines which DOM nodes are content versus boilerplate in six steps:

Simplify the DOM — strip non-content tags (<script>, <style>, etc.) and collapse single-child chains into a Collapsed DOM (CDOM) representation
Collect leaf text — walk the CDOM, concatenate text from every leaf node into one source string with tracked character offsets
Find anchors — identify 10-character substrings that appear exactly once in both the source and clean text, splitting the problem into independent segments
DP alignment — for each segment between anchors, run character-level dynamic programming with affine gap penalties to map source characters to clean-text characters
Score leaves — map alignment results back to leaf boundaries via stored offsets, giving each leaf a score: matched_chars / total_chars
Label nodes — leaves scoring above 0.667 are labeled "content", the rest "boilerplate", with labels propagating upward to parents

Alignment pipeline: extract leaf texts → anchor matching → DP alignment → per-leaf scores

Dataset

Dataset: williambrach/html-boilerplate-labeled — ~4k pages from CleanEval, Dragnet, CETD, Readability, and others (3,985 pages total).

Source	Train (ROUGE-1 F)	Test (ROUGE-1 F)
readability	0.993 (92)	0.997 (23)
scrapinghub	0.991 (145)	0.996 (36)
cetd	0.993 (560)	0.987 (140)
google-trends-2017	0.986 (144)	0.995 (36)
cleanportaleval	0.985 (57)	0.971 (14)
cleaneval	0.985 (590)	0.991 (148)
dragnet	0.983 (1,103)	0.983 (276)
l3s-gn1	0.920 (497)	0.927 (124)
Overall	0.976 (3,188)	0.978 (797)

Sample counts in parentheses.

Original Work

Paper: Vogels et al., "Web2Text: Deep Structured Boilerplate Removal" (ECIR 2018) — arxiv.org/abs/1801.02607
Original implementation (Scala): github.com/dalab/web2text

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2textpy-0.1.0.tar.gz (155.2 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web2textpy-0.1.0-py3-none-any.whl (12.6 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file web2textpy-0.1.0.tar.gz.

File metadata

Download URL: web2textpy-0.1.0.tar.gz
Upload date: Apr 2, 2026
Size: 155.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.16

File hashes

Hashes for web2textpy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7bfa5ade768eb1048771342ec65baf44d12920d0e5f0369f32ff551f8b456a7f`
MD5	`95dffdd3fe54e7a2f50688bc40f09ccb`
BLAKE2b-256	`2881d24ad151645dab00b5d3be0a9deffaa1ea29ecf6f3fd7c8ff33d445dad7d`

See more details on using hashes here.

File details

Details for the file web2textpy-0.1.0-py3-none-any.whl.

File metadata

Download URL: web2textpy-0.1.0-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 12.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.16

File hashes

Hashes for web2textpy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92cd69529e8ad141dd2367bcda9e96459e584c793324f6d57eb69f8b583506b7`
MD5	`683333117dbde9eeb028162baab99108`
BLAKE2b-256	`fe114a03babb4ce959a926d984b9c857bfccd4eb0dd9f475d9c477b620b95580`

See more details on using hashes here.

web2textpy 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

web2textpy

Installation

Quick Start

Step-by-Step API

How the Matching Algorithm Works

Dataset

Original Work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes