A library to chunk HTML web pages into plain text passages.
Project description
HtmlChunker is a library to split web page content into text passages. It uses Beautiful Soup with html5lib to parse HTML into a DOM tree, and then combines text from nodes of the DOM tree into passages.
Each passage contains either a single html node of text, or the text of the node and its siblings and descendants if the total number of words is less than a configurable maximum. The algorithm starts at the leaf nodes and attempts to aggregate node texts until the maximum number of words is reached.
Usage
from google_labs_html_chunker.html_chunker import HtmlChunker
html = "<p>Paragraph 1.</p>"
chunker = HtmlChunker(
max_words_per_aggregate_passage=200,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
Configurations
max_words_per_aggregate_passage
: Maximum number of words in a passage
comprised of multiple html nodes. A passage with text from only a single html
node may exceed this max.
greedily_aggregate_sibling_nodes
: If True
, sibling html nodes are greedily
aggregated into passages under max_words_per_aggregate_passage
words. If
False
, each sibling node is output as a separate passage if all siblings
cannot be combined into a single passage under
max_words_per_aggregate_passage
words.
html_tags_to_exclude
: Text within any of the tags in this set will not be
included in the output passages. Defaults to {"noscript", "script", "style"}
.
If you find your passages are too disjointed (insufficient context in a single
passage for your application), consider increasing
max_words_per_aggregate_passage
and/or setting
greedily_aggregate_sibling_nodes
to True
.
Example Outputs
For all examples, we will use the following input:
html = """
<div>
<h1>Heading</h1>
<p>Text before <a>link</a> and after.</p>
</div>
"""
Parsed DOM tree:
div
├── h1
│ └── "Heading"
└── p
├── "Text before"
├── a
│ └── "link"
└── "and after."
Example 1
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
All html nodes are output separately because there are 5 words in the
descendants of the <p>
node so they cannot all be combined in <=4 words:
passages: ["Heading", "Text before", "link", "and after."]
Example 2
chunker = HtmlChunker(
max_words_per_aggregate_passage=5,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
The children of the <p>
node can now be combined in <= 5 words:
passages: ["Heading", "Text before link and after."]
Example 3
chunker = HtmlChunker(
max_words_per_aggregate_passage=6,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
Text at the next higher level of the tree can now be included since the total number of words is <= 6:
passages: ["Heading Text before link and after."]
Example 4
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
The sibling children of the <p>
node are greedily aggregated while the total
is <=4 words:
passages: ["Heading", "Text before link", "and after."]
Example 5
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=False,
html_tags_to_exclude={"p"}
)
passages = chunker.chunk(html)
All text within the <p>
tag is excluded from the output.:
passages: ["Heading"]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for google_labs_html_chunker-0.0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6068202b4be91d96e1602a6715f219b344cc44bdfc3a085cc753368b28fa89ef |
|
MD5 | 13592c16da43626de7182f139218e28f |
|
BLAKE2b-256 | 9db463249cafc70d2d291c1a9563d1ccf8c91f58e6b21775a860a73bf97d74db |
Hashes for google_labs_html_chunker-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e8bbe800d3793154ac187e7f94351a46f10f9a4ce2559b00dc1ce2edf8b1b05 |
|
MD5 | b3725332693ab8c49e653b9e72104544 |
|
BLAKE2b-256 | f94f8497fea42988d36e3b45d69ecd03e08fc7edcea60848c3b9f2975da8999c |