Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Parsing speed

10k Page:

method haruka-parser 0.5.2 haruku-parser 0.4.9 html2text inscriptis trafilatura
Speed 379.4s 391.6s 272.8s 114.7s 343.9s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.6.0.tar.gz (5.8 MB view details)

Uploaded Source

Built Distribution

haruka_parser-0.6.0-py3-none-any.whl (5.8 MB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.6.0.tar.gz.

File metadata

  • Download URL: haruka_parser-0.6.0.tar.gz
  • Upload date:
  • Size: 5.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for haruka_parser-0.6.0.tar.gz
Algorithm Hash digest
SHA256 d0d75e9815d5ca6e2b73f7aca1619eda3128cf09df575353a57552a95a9f3075
MD5 7dedca42fd41af94c84822b6e56f276d
BLAKE2b-256 b1ce1be6e0d71ebbd0546762feeb23b709d9c8d0f2eabc2feb3f83fadf2ceb42

See more details on using hashes here.

File details

Details for the file haruka_parser-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed91b18f71f57c97ec717ea1b754225e0c1a718ae884f8caa74ee5ab3f6b1d35
MD5 c7094e10ab56ff6d38dcb76e977f1810
BLAKE2b-256 b66157078c606daa6cdc2ec3631d743fc07b10dca2af0b7ad1d34799bfe356ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page