Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Parsing speed

10k Page:

method haruka-parser 0.5.2 haruku-parser 0.4.9 html2text inscriptis trafilatura
Speed 379.4s 391.6s 272.8s 114.7s 343.9s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.8.3.tar.gz (36.6 MB view details)

Uploaded Source

Built Distribution

haruka_parser-0.8.3-py3-none-any.whl (36.8 MB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.8.3.tar.gz.

File metadata

  • Download URL: haruka_parser-0.8.3.tar.gz
  • Upload date:
  • Size: 36.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for haruka_parser-0.8.3.tar.gz
Algorithm Hash digest
SHA256 871ddbd74b2b012e960723e048f1571f571458267a09c444cede0fcf62e7ba51
MD5 289bf4d31de590990e18a549e1cd1dc5
BLAKE2b-256 ff5b6dfdb974d6a2864981aafac10f651d87fc6d4652bb0d44aeb8462de3a029

See more details on using hashes here.

File details

Details for the file haruka_parser-0.8.3-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.8.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5d1689269ef93c6e5dce9f7b064fe8615264b6278ad3e2728d923042b2e5b5c5
MD5 6096344b21cdd94ebd6ff5a70da3c54d
BLAKE2b-256 8d7fa4e99dc6752994f31e0b6cf9cad1bc9d2097f32a819bae8a2fd0a6eb8059

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page