Skip to main content

A small toolkit for HTML cleaning and pruning for RAG systems.

Project description

🤖🔍 HtmlRAG

License Static Badge

A toolkit to apply HtmlRAG in your own RAG systems.

📦 Installation

Install the package using pip:

pip install htmlrag

Or install the package from source:

pip install -e .

📖 User Guide

🧹 HTML Cleaning

from htmlrag import clean_html

question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<title>When was the bellagio in las vegas built?</title>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""

simplified_html = clean_html(html)
print(simplified_html)

# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>

🌲 Build Block Tree

from htmlrag import build_block_tree

block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=10)
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")

# Block Content:  <title>When was the bellagio in las vegas built?</title>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path:  ['html', 'div']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

✂️ Prune HTML Blocks with Embedding Model

from htmlrag import EmbedHTMLPruner

embed_model="/train_data_load/huggingface/tjj_hf/bge-large-en/"
query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
# alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [0, 2, 1]

#. alternatively you can use bm25 to rank the blocks
from htmlrag import BM25HTMLPruner
bm25_html_pruner = BM25HTMLPruner()
block_rankings=bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [0, 2, 1]

from transformers import AutoTokenizer

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")

max_context_window = 60
pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)

# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>

✂️ Prune HTML Blocks with Generative Model

from htmlrag import GenHTMLPruner

ckpt_path = "zstanjj/HTML-Pruner-Llama-1B"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, max_node_words=10)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html)
print(block_rankings)

# [1, 0]

max_context_window = 32
pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)

# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htmlrag-0.0.2.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

htmlrag-0.0.2-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file htmlrag-0.0.2.tar.gz.

File metadata

  • Download URL: htmlrag-0.0.2.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for htmlrag-0.0.2.tar.gz
Algorithm Hash digest
SHA256 1a9e41a8f24812675a4cb8beaccde12c96657553358591786566e2ed5c1a5e23
MD5 4becc0dede9f09b7b85f5eff025d61b2
BLAKE2b-256 185fde2910f4cecfda913401e15495ea3abe817eb76eee9ee66b5c074f4123d2

See more details on using hashes here.

File details

Details for the file htmlrag-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: htmlrag-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for htmlrag-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 542bf1b371010d29e9298df2f072aed2bb814ec7c40e7b232c79fd8fae616334
MD5 be4248c0ebd5da1bb3bec47f7a976721
BLAKE2b-256 08f3ff1bd393a6eab4349fcfc1450f5d60f04ae3b172adeafbacea71bfe5f577

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page