A library to chunk HTML web pages into plain text passages.
Project description
HtmlChunker is a library to split web page content into text passages. It uses Beautiful Soup with html5lib to parse HTML into a DOM tree, and then combines text from nodes of the DOM tree into passages.
Each passage contains either a single html node of text, or the text of the node and its siblings and descendants if the total number of words is less than a configurable maximum. The algorithm starts at the leaf nodes and attempts to aggregate node texts until the maximum number of words is reached.
Usage
from google_labs_html_chunker.html_chunker import HtmlChunker
html = "<p>Paragraph 1.</p>"
chunker = HtmlChunker(
max_words_per_aggregate_passage=200,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
Configurations
max_words_per_aggregate_passage
: Maximum number of words in a passage
comprised of multiple html nodes. A passage with text from only a single html
node may exceed this max.
greedily_aggregate_sibling_nodes
: If True
, sibling html nodes are greedily
aggregated into passages under max_words_per_aggregate_passage
words. If
False
, each sibling node is output as a separate passage if all siblings
cannot be combined into a single passage under
max_words_per_aggregate_passage
words.
If you find your passages are too disjointed (insufficient context in a single
passage for your application), consider increasing
max_words_per_aggregate_passage
and/or setting
greedily_aggregate_sibling_nodes
to True
.
Example Outputs
For all examples, we will use the following input:
html = """
<div>
<h1>Heading</h1>
<p>Text before <a>link</a> and after.</p>
</div>
"""
Parsed DOM tree:
div
├── h1
│ └── "Heading"
└── p
├── "Text before"
├── a
│ └── "link"
└── "and after."
Example 1
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
All html nodes are output separately because there are 5 words in the
descendants of the <p>
node so they cannot all be combined in <=4 words:
passages: ["Heading", "Text before", "link", "and after."]
Example 2
chunker = HtmlChunker(
max_words_per_aggregate_passage=5,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
The children of the <p>
node can now be combined in <= 5 words:
passages: ["Heading", "Text before link and after."]
Example 3
chunker = HtmlChunker(
max_words_per_aggregate_passage=6,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
Text at the next higher level of the tree can now be included since the total number of words is <= 6:
passages: ["Heading Text before link and after."]
Example 4
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
The sibling children of the <p>
node are greedily aggregated while the total
is <=4 words:
passages: ["Heading", "Text before link", "and after."]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for google_labs_html_chunker-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0a294e9289f22735b1e5505b73993f271cf956e090fd77578818b95b2877bce |
|
MD5 | 2168c5061d1227f824b9deb9016cf88b |
|
BLAKE2b-256 | db774935480d3a59eec5dd13ad74f9daab56969b231e816f994064d2d305f1eb |
Hashes for google_labs_html_chunker-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 872fafa1fa8fcb488f35e7d0b0284ebd2a3b6db8b43bbb517698175cf8aeca0c |
|
MD5 | b4acbbb1ad7f5fded7692e381c6ab1f9 |
|
BLAKE2b-256 | b5df2daf78ad047181b139eb4ef9a91ec43ea7963cc95398214f9f3db770da90 |