Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Parsing speed

10k Page:

method haruka-parser 0.5.2 haruku-parser 0.4.9 html2text inscriptis trafilatura
Speed 379.4s 391.6s 272.8s 114.7s 343.9s

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.9.1.tar.gz (36.7 MB view details)

Uploaded Source

Built Distribution

haruka_parser-0.9.1-py3-none-any.whl (36.8 MB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.9.1.tar.gz.

File metadata

  • Download URL: haruka_parser-0.9.1.tar.gz
  • Upload date:
  • Size: 36.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for haruka_parser-0.9.1.tar.gz
Algorithm Hash digest
SHA256 930ec0e4545c03cbbf77f3190055fbf3da6366116916ad033bee97a64699b574
MD5 e410afcbe444c7bc76d5e0f28f30d761
BLAKE2b-256 8127388f9ed13be1a5993f368a942fd72bd845f2326c6364339cd9d1b5eb2f77

See more details on using hashes here.

File details

Details for the file haruka_parser-0.9.1-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c48740063a02f48f9aa932035d1460297bcfb9d1f4fd0c05b4935f54913d9745
MD5 4d2804b88b5c2ced2142f3cad81b78ec
BLAKE2b-256 86f5b50816826a2449ee42ed9f15196327f6d3017fd867b18154111dc05633f6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page