A small toolkit for HTML cleaning and pruning for RAG systems.
Project description
🤖🔍 HtmlRAG
A toolkit to apply HtmlRAG in your own RAG systems.
📦 Installation
Install the package using pip:
pip install htmlrag
Or install the package from source:
pip install -e .
📖 User Guide
🧹 HTML Cleaning
from htmlrag import clean_html
question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<title>When was the bellagio in las vegas built?</title>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""
simplified_html = clean_html(html)
print(simplified_html)
# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>
🌲 Build Block Tree
from htmlrag import build_block_tree
block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=10)
for block in block_tree:
print("Block Content: ", block[0])
print("Block Path: ", block[1])
print("Is Leaf: ", block[2])
print("")
# Block Content: <title>When was the bellagio in las vegas built?</title>
# Block Path: ['html', 'title']
# Is Leaf: True
#
# Block Content: <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path: ['html', 'div']
# Is Leaf: True
#
# Block Content: <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path: ['html', 'p']
# Is Leaf: True
✂️ Prune HTML Blocks with Embedding Model
from htmlrag import EmbedHTMLPruner
embed_model="/train_data_load/huggingface/tjj_hf/bge-large-en/"
query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
# alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)
# [0, 2, 1]
#. alternatively you can use bm25 to rank the blocks
from htmlrag import BM25HTMLPruner
bm25_html_pruner = BM25HTMLPruner()
block_rankings=bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)
# [0, 2, 1]
from transformers import AutoTokenizer
chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
max_context_window = 60
pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)
# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>
✂️ Prune HTML Blocks with Generative Model
from htmlrag import GenHTMLPruner
ckpt_path = "zstanjj/HTML-Pruner-Llama-1B"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, max_node_words=10)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html)
print(block_rankings)
# [1, 0]
max_context_window = 32
pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
htmlrag-0.0.2.tar.gz
(11.6 kB
view details)
Built Distribution
htmlrag-0.0.2-py3-none-any.whl
(11.2 kB
view details)
File details
Details for the file htmlrag-0.0.2.tar.gz
.
File metadata
- Download URL: htmlrag-0.0.2.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a9e41a8f24812675a4cb8beaccde12c96657553358591786566e2ed5c1a5e23 |
|
MD5 | 4becc0dede9f09b7b85f5eff025d61b2 |
|
BLAKE2b-256 | 185fde2910f4cecfda913401e15495ea3abe817eb76eee9ee66b5c074f4123d2 |
File details
Details for the file htmlrag-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: htmlrag-0.0.2-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 542bf1b371010d29e9298df2f072aed2bb814ec7c40e7b232c79fd8fae616334 |
|
MD5 | be4248c0ebd5da1bb3bec47f7a976721 |
|
BLAKE2b-256 | 08f3ff1bd393a6eab4349fcfc1450f5d60f04ae3b172adeafbacea71bfe5f577 |