A small toolkit for HTML cleaning and pruning for RAG systems.
Project description
🤖🔍 HtmlRAG
A toolkit to apply HtmlRAG in your own RAG systems.
📦 Installation
Install the package using pip:
pip install htmlrag
Or install the package from source:
pip install -e .
📖 User Guide
🧹 HTML Cleaning
from htmlrag import clean_html
question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<title>When was the bellagio in las vegas built?</title>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""
simplified_html = clean_html(html)
print(simplified_html)
# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>
🌲 Build Block Tree
from htmlrag import build_block_tree
block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=10)
for block in block_tree:
print("Block Content: ", block[0])
print("Block Path: ", block[1])
print("Is Leaf: ", block[2])
print("")
# Block Content: <title>When was the bellagio in las vegas built?</title>
# Block Path: ['html', 'title']
# Is Leaf: True
#
# Block Content: <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path: ['html', 'div']
# Is Leaf: True
#
# Block Content: <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path: ['html', 'p']
# Is Leaf: True
✂️ Prune HTML Blocks with Embedding Model
from htmlrag import EmbedHTMLPruner
embed_model="/train_data_load/huggingface/tjj_hf/bge-large-en/"
query_instruction_for_retrieval = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True, query_instruction_for_retrieval = query_instruction_for_retrieval)
# alternatively you can init a remote TEI model, refer to https://github.com/huggingface/text-embeddings-inference.
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings=embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)
# [0, 2, 1]
#. alternatively you can use bm25 to rank the blocks
from htmlrag import BM25HTMLPruner
bm25_html_pruner = BM25HTMLPruner()
block_rankings=bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)
# [0, 2, 1]
from transformers import AutoTokenizer
chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
max_context_window = 60
pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)
# <html>
# <title>When was the bellagio in las vegas built?</title>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>
✂️ Prune HTML Blocks with Generative Model
from htmlrag import GenHTMLPruner
ckpt_path = "zstanjj/HTML-Pruner-Llama-1B"
gen_embed_pruner = GenHTMLPruner(gen_model=ckpt_path, max_node_words=10)
block_rankings = gen_embed_pruner.calculate_block_rankings(question, pruned_html)
print(block_rankings)
# [1, 0]
max_context_window = 32
pruned_html = gen_embed_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, max_context_window)
print(pruned_html)
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
htmlrag-0.0.4.tar.gz
(11.7 kB
view details)
Built Distribution
htmlrag-0.0.4-py3-none-any.whl
(11.2 kB
view details)
File details
Details for the file htmlrag-0.0.4.tar.gz
.
File metadata
- Download URL: htmlrag-0.0.4.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2bae4ea6d6297d99c90fcd1cf86525a696582f40568836c5ee0f654e31003238 |
|
MD5 | cc81bc63c9548f4d656b3b093a1ef427 |
|
BLAKE2b-256 | c42512b7e8061d7813b6540cbee8916149ce0e5580bf7ec8b65b579dd0972ba6 |
File details
Details for the file htmlrag-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: htmlrag-0.0.4-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75599bf4d4decb17c89b4191d55a7d75705912397057ab6296edb54946f51e8f |
|
MD5 | 547b0204f48549e8d1225f55284cd6d7 |
|
BLAKE2b-256 | 02663a7a3f387c22cde05c4ba3fddd0ff72ff81468793da6df55f6d3962e459b |