Purify HTML by filtering tags and classes
Project description
PureHTML
Purify HTML by filtering tags and classes
Install
pip install --upgrade purehtml
Usage
from purehtml import purify_html_files
from pathlib import Path
html_root = Path(__file__).parent / "samples"
html_paths = list(html_root.glob("*.html"))
html_path_and_purified_content_list = purify_html_files(html_paths)
for item in html_path_and_purified_content_list:
html_path = item["html_path"]
purified_content = item["purified_content"]
print(html_path)
print(purified_content)
What params should I choose in different scenarios?
Functions
purify_html_str ( html_str : str )
purify_html_file ( html_path : Union[Path, str] )
purify_html_files( html_paths: list[Union[Path, str]] )
Params
Here are the params of purify_html_files()
:
- verbose:
bool
(defaultFalse
)True
: Output to consoleFalse
: No output to console
- output_format:
str
(default"html"
)"html"
: Output HTML format (.html.pure
)"markdown"
: Output markdown format (.md
)
- keep_href:
bool
(defaultFalse
)True
: Keephref
in<a>
tags, and keepsrc
in<img>
tags- This is useful for detailed information retrieval
False
: Do not keephref
andsrc
- keep_format_tags:
bool
(defaultTrue
)True
: Keep format tags- such as:
<sub>
,<sup>
,<b>
,<strong>
,<em>
,<a>
,<i>
,<u>
,mark
,del
,cite
,blockquote
- This is useful for rendering HTML
- such as:
False
: Remove format tags
- keep_group_tags:
bool
(defaultTrue
)True
: Keep group tags:<div>
,<section>
,<details>
- This is useful for hierarchical processing, such as grouping texts in RAG
False
: Remove group tags
- math_style:
str
(default"latex"
)"latex"
: Convert math tag to latex string- This is useful for LLM and RAG
"latex_in_tag"
: Wrap above latex string with tag<div>
for block,<span>
for inline- This is useful for hierarchical processing
"html"
: Keep math formulas in mathml format- This is useful for rendering HTML
For: LLM, RAG, text chunking and embedding
Hierarchical:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False,
keep_format_tags=False,
keep_group_tags=True,
math_style="latex_in_tag",
)
Flat:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False,
keep_format_tags=False,
keep_group_tags=False, # <--
math_style="latex", # <--
)
For: HTML rendering
With links:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=True, # <--
keep_format_tags=True, # <--
keep_group_tags=True, # <--
math_style="html", # <--
)
Without links: (This is the default config in dev)
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False, # <--
keep_format_tags=True,
keep_group_tags=True,
math_style="html",
)
Even without any format:
results = purify_html_files(
html_paths,
verbose=False,
output_format="html",
keep_href=False,
keep_format_tags=False, # <--
keep_group_tags=True,
math_style="html",
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
purehtml-0.7.5.tar.gz
(9.8 kB
view details)
Built Distribution
File details
Details for the file purehtml-0.7.5.tar.gz
.
File metadata
- Download URL: purehtml-0.7.5.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07c64fd5b503081ccbc13f9ce15c9dc1879331e6e9a2b163be16b5dd173eecba |
|
MD5 | 221bc2aed6805aac1a558efcf0f3c486 |
|
BLAKE2b-256 | 8a2784082ac063093a234a7dbced3ea3cc7a4f1fbe75f209dd314bc69418463a |
File details
Details for the file purehtml-0.7.5-py3-none-any.whl
.
File metadata
- Download URL: purehtml-0.7.5-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71b4cb8c9bf7e07e7ab247a4630ff2e3395a36fa58f0429f4ca602e169cc469b |
|
MD5 | 6f08a87918822ab55ba276f7debd5d6e |
|
BLAKE2b-256 | 589a79f3da8223db833a3d2cad629dff1930b872950a279892dcd390297199de |