Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.5.2.tar.gz (5.7 MB view details)

Uploaded Source

Built Distribution

haruka_parser-0.5.2-py3-none-any.whl (5.7 MB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.5.2.tar.gz.

File metadata

  • Download URL: haruka_parser-0.5.2.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for haruka_parser-0.5.2.tar.gz
Algorithm Hash digest
SHA256 7fbb54952a6af39e00454b3768e517457bf8fd4229068bc8e0be0796e2170793
MD5 66dd84ce3f2f3cb218bc6c59f3723806
BLAKE2b-256 f291fc762686ad5fcd9e95b4aa6075745934d2ba377bffa96fc428950d502e5d

See more details on using hashes here.

File details

Details for the file haruka_parser-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 032dafde13c49747e4c52b726bef283f62afa5d863e686915fe96375d85638aa
MD5 776f956783567a227c8d62acdfe109fc
BLAKE2b-256 6e148a1f0be74c9b500a58b2d5ee9b36dc06e0da8c718a073698cb62a468b024

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page