Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.4.5.tar.gz (65.2 kB view details)

Uploaded Source

Built Distribution

haruka_parser-0.4.5-py3-none-any.whl (70.2 kB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.4.5.tar.gz.

File metadata

  • Download URL: haruka_parser-0.4.5.tar.gz
  • Upload date:
  • Size: 65.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for haruka_parser-0.4.5.tar.gz
Algorithm Hash digest
SHA256 2620bc79f1799bd34d170cd03e716788a2464475f51464fc4f978d78d0d05096
MD5 0d53e08d370a0e597771a71cac33989f
BLAKE2b-256 d9a5c1ed9437c67ead03d6e9a35824a7a7fa050b1ecb973d4c93b61472e7e335

See more details on using hashes here.

File details

Details for the file haruka_parser-0.4.5-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 fd92290714829b5c1fa71455e37483a6f5cd5dbef6ddec8811667b9bf9b614a9
MD5 a3106288ebbb7e3237719fd8e111bf52
BLAKE2b-256 67f2efa6a54f8532ac69065d15731fe69b63d6468443a0376373fdf6e4aee9e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page