A simple HTML Parser
Project description
Haruka Parser
A simple HTML Parser
Install
pip install haruka-parser
Usage
from haruka_parser.extract import extract_text
html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<msup>
<mi>e</mi>
<mrow>
<mi>i</mi>
<mi>π</mi>
</mrow>
</msup>
<mo>=</mo>
<mn>-1</mn>
</math>
<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>
</body>
</html>"""
text, info = extract_text(html)
print(text)
print(info)
Configurations
from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
"readability": False,
"skip_large_links": False,
"extract_latex": True,
"extract_cnki_latex": False,
"escape_dollars": True,
"remove_buttons": True,
"remove_edit_buttons": True,
"remove_image_figures": True,
"markdown_code": True,
"markdown_headings": True,
"remove_chinese": False,
"boilerplate_config": {
"enable": False,
"ratio_threshold": 0.18,
"absolute_threshold": 10,
"end_threshold": 15,
},
}
Parsing speed
10k Page:
| method | haruka-parser 0.5.2 | haruku-parser 0.4.9 | html2text | inscriptis | trafilatura |
|---|---|---|---|---|---|
| Speed | 379.4s | 391.6s | 272.8s | 114.7s | 343.9s |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
haruka_parser-1.1.1.tar.gz
(37.7 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file haruka_parser-1.1.1.tar.gz.
File metadata
- Download URL: haruka_parser-1.1.1.tar.gz
- Upload date:
- Size: 37.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc22c3e06ffad1b84a0065e70c992e137ccfe9411bfe1e854d42ba01a5c5761a
|
|
| MD5 |
cdef9d1ea33947f238dc4079821d40b9
|
|
| BLAKE2b-256 |
763d45d1d727e220c78bf4f278a6640c766374343b36e48aaf4895167e52f551
|
File details
Details for the file haruka_parser-1.1.1-py3-none-any.whl.
File metadata
- Download URL: haruka_parser-1.1.1-py3-none-any.whl
- Upload date:
- Size: 37.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41c1f225ffcd4b56e29a8a278b768c50d43702bcca51ffdc98823befd44f4d85
|
|
| MD5 |
1d2038aec219ceff7ba9c393a058385e
|
|
| BLAKE2b-256 |
c01a337a43cc1eac6bcea68857a3a90a04fc544d66e5576d0349e78246eb70d8
|