Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.4.6.tar.gz (65.3 kB view details)

Uploaded Source

Built Distribution

haruka_parser-0.4.6-py3-none-any.whl (70.2 kB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.4.6.tar.gz.

File metadata

  • Download URL: haruka_parser-0.4.6.tar.gz
  • Upload date:
  • Size: 65.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for haruka_parser-0.4.6.tar.gz
Algorithm Hash digest
SHA256 c020391e1a119d57a8b34d4a4b051fd25e0130ffa8ddd28647475eb3494075b4
MD5 01cf897182e754cf28756e21a59ae6ca
BLAKE2b-256 113c8e009f520d0d9fc853833880f7587a7fdc8e83976d3a8de16decf981eff8

See more details on using hashes here.

File details

Details for the file haruka_parser-0.4.6-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.4.6-py3-none-any.whl
Algorithm Hash digest
SHA256 bc8393f434706e6526e0db113ec9caf8a41f77a22b8e4748227c8ec1c0f7246b
MD5 80e2022005c2d7de7250297882946eb9
BLAKE2b-256 1db10c4d18b5761cdcc9becf194506ebcb9d7dd633423dc29f2d886215b7b266

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page