Python reimplementation of the Web2Text pipeline for labeling HTML DOM nodes as content or boilerplate
Project description
web2textpy
Python reimplementation of the Web2Text pipeline for labeling HTML DOM nodes as content or boilerplate using paired (raw_html, clean_text) data.
Installation
uv add web2textpy
Quick Start
from datasets import load_dataset
from web2text import run_pipeline
ds = load_dataset("williambrach/html-boilerplate-labeled", split="test")
row = ds[0]
tree, extracted_text, metrics = run_pipeline(row["html"], row["text"])
print(extracted_text[:200])
print(metrics)
Step-by-Step API
Each stage of the pipeline is exposed as a standalone function:
from web2text import build_cdom, extract_leaves, align, label_nodes, extract_text, evaluate
# 1. Parse HTML into a collapsed DOM tree
tree = build_cdom(html_string)
# 2. Extract ordered text-bearing leaf nodes
leaves = extract_leaves(tree) # [(element, "normalized text"), ...]
# 3. Align leaf texts against ground-truth clean text
scores = align(leaves, clean_text) # {leaf_id: 0.0-1.0 match score}
# 4. Label each node as "content" or "boilerplate"
tree = label_nodes(tree, scores, threshold=0.667)
# 5. Extract text from content-labeled nodes
result = extract_text(tree)
# 6. Evaluate against ground truth
metrics = evaluate(result, clean_text)
# => {'token_f1': 0.99, 'precision': 0.99, 'recall': 0.99, 'rouge1_f': 0.99, 'bleu': 98.5, 'chrf': 98.8}
How the Matching Algorithm Works
Given raw HTML and its known clean text, the algorithm determines which DOM nodes are content versus boilerplate in six steps:
- Simplify the DOM — strip non-content tags (
<script>,<style>, etc.) and collapse single-child chains into a Collapsed DOM (CDOM) representation - Collect leaf text — walk the CDOM, concatenate text from every leaf node into one source string with tracked character offsets
- Find anchors — identify 10-character substrings that appear exactly once in both the source and clean text, splitting the problem into independent segments
- DP alignment — for each segment between anchors, run character-level dynamic programming with affine gap penalties to map source characters to clean-text characters
- Score leaves — map alignment results back to leaf boundaries via stored offsets, giving each leaf a score:
matched_chars / total_chars - Label nodes — leaves scoring above
0.667are labeled"content", the rest"boilerplate", with labels propagating upward to parents
Dataset
Dataset: williambrach/html-boilerplate-labeled — ~4k pages from CleanEval, Dragnet, CETD, Readability, and others (3,985 pages total).
| Source | Train (ROUGE-1 F) | Test (ROUGE-1 F) |
|---|---|---|
| readability | 0.993 (92) | 0.997 (23) |
| scrapinghub | 0.991 (145) | 0.996 (36) |
| cetd | 0.993 (560) | 0.987 (140) |
| google-trends-2017 | 0.986 (144) | 0.995 (36) |
| cleanportaleval | 0.985 (57) | 0.971 (14) |
| cleaneval | 0.985 (590) | 0.991 (148) |
| dragnet | 0.983 (1,103) | 0.983 (276) |
| l3s-gn1 | 0.920 (497) | 0.927 (124) |
| Overall | 0.976 (3,188) | 0.978 (797) |
Sample counts in parentheses.
Original Work
- Paper: Vogels et al., "Web2Text: Deep Structured Boilerplate Removal" (ECIR 2018) — arxiv.org/abs/1801.02607
- Original implementation (Scala): github.com/dalab/web2text
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web2textpy-0.1.0.tar.gz.
File metadata
- Download URL: web2textpy-0.1.0.tar.gz
- Upload date:
- Size: 155.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bfa5ade768eb1048771342ec65baf44d12920d0e5f0369f32ff551f8b456a7f
|
|
| MD5 |
95dffdd3fe54e7a2f50688bc40f09ccb
|
|
| BLAKE2b-256 |
2881d24ad151645dab00b5d3be0a9deffaa1ea29ecf6f3fd7c8ff33d445dad7d
|
File details
Details for the file web2textpy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: web2textpy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92cd69529e8ad141dd2367bcda9e96459e584c793324f6d57eb69f8b583506b7
|
|
| MD5 |
683333117dbde9eeb028162baab99108
|
|
| BLAKE2b-256 |
fe114a03babb4ce959a926d984b9c857bfccd4eb0dd9f475d9c477b620b95580
|