Skip to main content

A small toolkit for HTML cleaning and pruning for RAG systems.

Project description

🤖🔍 HtmlRAG

License Static Badge

A toolkit to apply HtmlRAG in your own RAG systems.

📦 Installation

Install the package using pip:

pip install htmlrag

Or install the package from source:

pip install -e .

📖 User Guide

🧹 HTML Cleaning

from htmlrag import clean_html

question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<title>When was the bellagio in las vegas built?</title>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""

simplified_html = clean_html(html)
print(simplified_html)

# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>

🌲 Build Block Tree

from htmlrag import build_block_tree

block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=10)
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")

# Block Content:  <title>When was the bellagio in las vegas built?</title>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path:  ['html', 'div']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

✂️ Prune HTML Blocks with Embedding Model

from htmlrag import EmbedHTMLPruner

embed_model="/train_data_load/huggingface/tjj_hf/bge-large-en/"
query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
# alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [0, 2, 1]

#. alternatively you can use bm25 to rank the blocks
from htmlrag import BM25HTMLPruner
bm25_html_pruner = BM25HTMLPruner()
block_rankings=bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [0, 2, 1]

from transformers import AutoTokenizer

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")

max_context_window = 60
pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)

# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>

✂️ Prune HTML Blocks with Generative Model

from htmlrag import GenHTMLPruner

ckpt_path = "zstanjj/HTML-Pruner-Llama-1B"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, max_node_words=10)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html)
print(block_rankings)

# [1, 0]

max_context_window = 32
pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)

# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htmlrag-0.0.3.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

htmlrag-0.0.3-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file htmlrag-0.0.3.tar.gz.

File metadata

  • Download URL: htmlrag-0.0.3.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for htmlrag-0.0.3.tar.gz
Algorithm Hash digest
SHA256 ce8867f0336217ab453c24f8b41c4f27fc80fed406e50d4bbe4acd5b8e4cd2e7
MD5 665e28a79212e37471af0e67a71f991e
BLAKE2b-256 959fdf74ab7693673c0761b78ec3c127747e0be8ee87a44e35455e6e57220de4

See more details on using hashes here.

File details

Details for the file htmlrag-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: htmlrag-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for htmlrag-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e7e8c6d8cbcab893d80c601a337d5ee931218f69087f352e278b4edb22cfc006
MD5 a2e408a9afded367b5e228076003c9e1
BLAKE2b-256 375da307a14b3fda2f9495cb2af91bf122149e340e8b89f6f7db921473d00927

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page