A library to chunk HTML web pages into plain text passages.
Project description
HtmlChunker is a library to split web page content into text passages. It uses Beautiful Soup with html5lib to parse HTML into a DOM tree, and then combines text from nodes of the DOM tree into passages.
Each passage contains either a single html node of text, or the text of the node and its siblings and descendants if the total number of words is less than a configurable maximum. The algorithm starts at the leaf nodes and attempts to aggregate node texts until the maximum number of words is reached.
Usage
from google_labs_html_chunker.html_chunker import HtmlChunker
html = "<p>Paragraph 1.</p>"
chunker = HtmlChunker(
max_words_per_aggregate_passage=200,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
Configurations
max_words_per_aggregate_passage
: Maximum number of words in a passage
comprised of multiple html nodes. A passage with text from only a single html
node may exceed this max.
greedily_aggregate_sibling_nodes
: If True
, sibling html nodes are greedily
aggregated into passages under max_words_per_aggregate_passage
words. If
False
, each sibling node is output as a separate passage if all siblings
cannot be combined into a single passage under
max_words_per_aggregate_passage
words.
If you find your passages are too disjointed (insufficient context in a single
passage for your application), consider increasing
max_words_per_aggregate_passage
and/or setting
greedily_aggregate_sibling_nodes
to True
.
Example Outputs
For all examples, we will use the following input:
html = """
<div>
<h1>Heading</h1>
<p>Text before <a>link</a> and after.</p>
</div>
"""
Parsed DOM tree:
div
├── h1
│ └── "Heading"
└── p
├── "Text before"
├── a
│ └── "link"
└── "and after."
Example 1
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
All html nodes are output separately because there are 5 words in the
descendants of the <p>
node so they cannot all be combined in <=4 words:
passages: ["Heading", "Text before", "link", "and after."]
Example 2
chunker = HtmlChunker(
max_words_per_aggregate_passage=5,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
The children of the <p>
node can now be combined in <= 5 words:
passages: ["Heading", "Text before link and after."]
Example 3
chunker = HtmlChunker(
max_words_per_aggregate_passage=6,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
Text at the next higher level of the tree can now be included since the total number of words is <= 6:
passages: ["Heading Text before link and after."]
Example 4
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
The sibling children of the <p>
node are greedily aggregated while the total
is <=4 words:
passages: ["Heading", "Text before link", "and after."]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for google_labs_html_chunker-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab7f1dca08ef5a328ee9d6d3cfb98f6ffad102b4bbc9ca5122af55eb4bc952bf |
|
MD5 | 5d0d0ef8a8cc6fed93713583997a684e |
|
BLAKE2b-256 | 272f8fa19ecf16ea1acab20f5af184acbe3303c893f4a41aff3f659e8a99fab7 |
Hashes for google_labs_html_chunker-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94bba35015de72eb4fc778b43a720d81c2b6c0c7bc7848b5aefc79ccbd95172a |
|
MD5 | 87334eb2138fe3fac85067a44ca1f360 |
|
BLAKE2b-256 | 5a22693e35bd284cc6dcea1c12a5e2cb68d4c2c43413cba94551caa0995965c1 |