Skip to main content

A simple HTML Parser

Project description

Haruka Parser

A simple HTML Parser

Install

pip install haruka-parser

Usage

from haruka_parser.extract import extract_text

html = """<!DOCTYPE html>
<html>
<body>
<!-- Using MathML -->
<p>Using MathJax:</p>
<script type="math/tex; mode=display" id="MathJax-Element-1">{e}^{i\pi }=-1</script>
<!-- Using MathML -->
<p>Using MathML:</p>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>e</mi>
    <mrow>
      <mi>i</mi>
      <mi>&#x03C0;</mi>
    </mrow>
  </msup>
  <mo>=</mo>
  <mn>-1</mn>
</math>

<!-- Using AsciiMath -->
<p>Using AsciiMath:</p>
<script type="math/asciimath">
e^(i*pi) = -1
</script>

</body>
</html>"""

text, info = extract_text(html)
print(text)
print(info)

Configurations

from haruka_parser.extract import DEFAULT_CONFIG
DEFAULT_CONFIG = {
    "readability": False,
    "skip_large_links": False,
    "extract_latex": True,
    "extract_cnki_latex": False,
    "escape_dollars": True,
    "remove_buttons": True,
    "remove_edit_buttons": True,
    "remove_image_figures": True,
    "markdown_code": True,
    "markdown_headings": True,
    "remove_chinese": False,
    "boilerplate_config": {
        "enable": False,
        "ratio_threshold": 0.18,
        "absolute_threshold": 10,
        "end_threshold": 15,
    },
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

haruka_parser-0.5.0.tar.gz (5.8 MB view details)

Uploaded Source

Built Distribution

haruka_parser-0.5.0-py3-none-any.whl (5.8 MB view details)

Uploaded Python 3

File details

Details for the file haruka_parser-0.5.0.tar.gz.

File metadata

  • Download URL: haruka_parser-0.5.0.tar.gz
  • Upload date:
  • Size: 5.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for haruka_parser-0.5.0.tar.gz
Algorithm Hash digest
SHA256 f0073450db2dd334a40fe681e1662b4e871e44c38b3192ad9e44de266c3a5e82
MD5 742871bdc9d6483a63d8942618956a4c
BLAKE2b-256 a51fa87bb707222e1dc23c1e7ffbcac09d1591033f4a9965262f4f15cfc5e87d

See more details on using hashes here.

File details

Details for the file haruka_parser-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for haruka_parser-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2aa042c22301e94f28a887574d9bb338d75aaa654acd431c9552e68acd86cd4
MD5 96b02d64073f11ddcd88a9e349f30c8a
BLAKE2b-256 5bb27fe35144124b75d350edbba4a6a4a9ee7ae4ba46be262931084f51905264

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page