Skip to main content

A small toolkit for HTML cleaning and pruning for RAG systems.

Project description

🤖🔍 HtmlRAG

License: MIT

A toolkit to apply HtmlRAG in your own RAG systems.

📦 Installation

pip install -e .

📖 User Guide

🧹 HTML Cleaning

from htmlrag import clean_html

question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<title>When was the bellagio in las vegas built?</title>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""

simplified_html = clean_html(html)
print(simplified_html)

# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>

🌲 Build Block Tree

from htmlrag import build_block_tree

block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=10)
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")

# Block Content:  <title>When was the bellagio in las vegas built?</title>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path:  ['html', 'div']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

✂️ Prune HTML Blocks with Embedding Model

from htmlrag import EmbedHTMLPruner

embed_html_pruner = EmbedHTMLPruner(embed_model="bm25")
block_rankings = embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [0, 2, 1]

from transformers import AutoTokenizer

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")

max_context_window = 60
pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer,
                                           max_context_window)
print(pruned_html)

# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>

✂️ Prune HTML Blocks with Generative Model

from htmlrag import GenHTMLPruner

ckpt_path = "zstanjj/HTML-Pruner-Llama-1B"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, max_node_words=10)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html)
print(block_rankings)

# [1, 0]

max_context_window = 32
pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)

# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htmlrag-0.0.1.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

htmlrag-0.0.1-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file htmlrag-0.0.1.tar.gz.

File metadata

  • Download URL: htmlrag-0.0.1.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for htmlrag-0.0.1.tar.gz
Algorithm Hash digest
SHA256 acf8c69cd21b1c81f96b9e4565d7b16083fbc1c44f6a570b45ee5acc4b6f3d61
MD5 d2e03ea14ec00330c0ac4a216c03507b
BLAKE2b-256 b6ea6bf92a46981b9712d26a506c06a94ea5c6ef70e12a7b488feb893c96a840

See more details on using hashes here.

File details

Details for the file htmlrag-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: htmlrag-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for htmlrag-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7bac07bca7c52bc20c3cee1c07d5aa8e0e080c5971302b46a6400cc2f7b79264
MD5 dae1b5cac0460b68b9dadf8288ed883e
BLAKE2b-256 19fef9d0116fd552320a2597aa821099dfaf39b326dc7c4effa0548d3d6c2b64

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page