Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Parsing speed

10k Page:

method haruka-parser 0.5.2 haruku-parser 0.4.9 html2text inscriptis trafilatura
Speed 379.4s 391.6s 272.8s 114.7s 343.9s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.6.2.tar.gz (6.8 MB view details)

Uploaded Source

Built Distribution

haruka_parser-0.6.2-py3-none-any.whl (6.9 MB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.6.2.tar.gz.

File metadata

  • Download URL: haruka_parser-0.6.2.tar.gz
  • Upload date:
  • Size: 6.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.4

File hashes

Hashes for haruka_parser-0.6.2.tar.gz
Algorithm Hash digest
SHA256 34d070e654fbe33dad4b721d40de8ad8170e1534924fe401a574075ef735d1f3
MD5 523642f82653e1665972e7b4c67b20ee
BLAKE2b-256 2c2d75c637041c3c01c8acd4a86a97444c9c6f3de7041c7020c04aefee5176d2

See more details on using hashes here.

File details

Details for the file haruka_parser-0.6.2-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b5bd41bad2fcdfa10603e2b7ffd3a5c0d411e873fb7d9f80e0f8ba6c41c5e58c
MD5 0e1fd329d068ef9580d6412404f7c9c9
BLAKE2b-256 9df32d17bee350c9ba28f44f041205ff8376bf30b14139e4fc3360a79dfcf310

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page