A library to chunk HTML web pages into plain text passages.
Project description
HtmlChunker is a library to split web page content into text passages. It uses Beautiful Soup with html5lib to parse HTML into a DOM tree, and then combines text from nodes of the DOM tree into passages.
Each passage contains either a single html node of text, or the text of the node and its siblings and descendants if the total number of words is less than a configurable maximum. The algorithm starts at the leaf nodes and attempts to aggregate node texts until the maximum number of words is reached.
Usage
from google_labs_html_chunker.html_chunker import HtmlChunker
html = "<p>Paragraph 1.</p>"
chunker = HtmlChunker(
max_words_per_aggregate_passage=200,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
Configurations
max_words_per_aggregate_passage
: Maximum number of words in a passage
comprised of multiple html nodes. A passage with text from only a single html
node may exceed this max.
greedily_aggregate_sibling_nodes
: If True
, sibling html nodes are greedily
aggregated into passages under max_words_per_aggregate_passage
words. If
False
, each sibling node is output as a separate passage if all siblings
cannot be combined into a single passage under
max_words_per_aggregate_passage
words.
If you find your passages are too disjointed (insufficient context in a single
passage for your application), consider increasing
max_words_per_aggregate_passage
and/or setting
greedily_aggregate_sibling_nodes
to True
.
Example Outputs
For all examples, we will use the following input:
html = """
<div>
<h1>Heading</h1>
<p>Text before <a>link</a> and after.</p>
</div>
"""
Parsed DOM tree:
div
├── h1
│ └── "Heading"
└── p
├── "Text before"
├── a
│ └── "link"
└── "and after."
Example 1
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
All html nodes are output separately because there are 5 words in the
descendants of the <p>
node so they cannot all be combined in <=4 words:
passages: ["Heading", "Text before", "link", "and after."]
Example 2
chunker = HtmlChunker(
max_words_per_aggregate_passage=5,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
The children of the <p>
node can now be combined in <= 5 words:
passages: ["Heading", "Text before link and after."]
Example 3
chunker = HtmlChunker(
max_words_per_aggregate_passage=6,
greedily_aggregate_sibling_nodes=False,
)
passages = chunker.chunk(html)
Text at the next higher level of the tree can now be included since the total number of words is <= 6:
passages: ["Heading Text before link and after."]
Example 4
chunker = HtmlChunker(
max_words_per_aggregate_passage=4,
greedily_aggregate_sibling_nodes=True,
)
passages = chunker.chunk(html)
The sibling children of the <p>
node are greedily aggregated while the total
is <=4 words:
passages: ["Heading", "Text before link", "and after."]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for google_labs_html_chunker-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa9ed47025086b4045b8bb7b8ee4d3d0505b743b86f9b7227064cf28a6f02c41 |
|
MD5 | ab10e3dc29fd9e9263c4412334ddb689 |
|
BLAKE2b-256 | 6a08039cd64da04d48c5bf5833ca448830e0934fd64d701da5b946b015cda284 |
Hashes for google_labs_html_chunker-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97af9a12918961eceeaadb4f50bb81c354b747867d4d9a06da9096e1d3e84639 |
|
MD5 | 1515b66511b8b4aa6e8d11d13f840383 |
|
BLAKE2b-256 | a0300d72c4f9d9231066d1b4ef75c4be9de1813ab96df29a5a83b9037d398ccb |