Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Parsing speed

10k Page:

method haruka-parser 0.5.2 haruku-parser 0.4.9 html2text inscriptis trafilatura
Speed 379.4s 391.6s 272.8s 114.7s 343.9s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-1.1.1.tar.gz (37.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

haruka_parser-1.1.1-py3-none-any.whl (37.8 MB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-1.1.1.tar.gz.

File metadata

  • Download URL: haruka_parser-1.1.1.tar.gz
  • Upload date:
  • Size: 37.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for haruka_parser-1.1.1.tar.gz
Algorithm Hash digest
SHA256 dc22c3e06ffad1b84a0065e70c992e137ccfe9411bfe1e854d42ba01a5c5761a
MD5 cdef9d1ea33947f238dc4079821d40b9
BLAKE2b-256 763d45d1d727e220c78bf4f278a6640c766374343b36e48aaf4895167e52f551

See more details on using hashes here.

File details

Details for the file haruka_parser-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: haruka_parser-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 37.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for haruka_parser-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 41c1f225ffcd4b56e29a8a278b768c50d43702bcca51ffdc98823befd44f4d85
MD5 1d2038aec219ceff7ba9c393a058385e
BLAKE2b-256 c01a337a43cc1eac6bcea68857a3a90a04fc544d66e5576d0349e78246eb70d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page