Skip to main content

cleave html headers and text

Project description

License: MIT GitHub Latest Release

GitHub Latest Pre-Release GitHub Continuous Integration

HTML Cleaver 🍀🦫

Tool for parsing HTML into a chain of chunks with relevant headers.

The API entry-point is in src/html_cleaver/cleaver.
The logical algorithm and data-structures are in src/html_cleaver/handler.

This is a "tree-capitator" if you will,
cleaving headers together while cleaving text apart.

Quickstart:

pip install html-cleaver

Optionally, if you're working with HTML that requires javascript rendering:
pip install selenium

Testing an example on the command-line: python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/

Example usage:

Cleaving pages of varying difficulties:

from html_cleaver.cleaver import get_cleaver

# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
    
    # handle chunk-events directly
    # (example of favorable structure yielding high-quality chunks)
    cleaver.parse_events(
        ["https://plato.stanford.edu/entries/goedel/"],
        print)
    
    # get collection of chunks
    # (example of moderate structure yielding medium-quality chunks)
    for c in cleaver.parse_chunk_sequence(
            ["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
        print(c)
    
    # sequence of chunks from sequence of pages
    # (examples of challenging structure yielding poor-quality chunks)
    l = [
        "https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
        "https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
    for c in cleaver.parse_chunk_sequence(l):
        print(c)

# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
    cleaver.parse_events(
        ["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
        print)

Example usage with Selenium:

Using selenium on a page that requires javascript to load contents:

from html_cleaver.cleaver import get_cleaver

print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)

print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)

Development:

Testing:

Testing without Poetry:
pip install lxml
pip install selenium
python -m unittest discover -s src

Testing with Poetry:
poetry install
poetry run pytest

Build:

Building from source:
rm dist/*
python -m build

Installing from the build:
pip install dist/*.whl

Publishing from the build:
python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*
python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_cleaver-0.3.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

html_cleaver-0.3.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file html_cleaver-0.3.0.tar.gz.

File metadata

  • Download URL: html_cleaver-0.3.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for html_cleaver-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d7901934f083e6f36682645e314cea599a2cb2e8139e1a3a5ab581235f0e3839
MD5 5b17920daf103e4824e8218d379afe48
BLAKE2b-256 c20b6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05

See more details on using hashes here.

File details

Details for the file html_cleaver-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: html_cleaver-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for html_cleaver-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f3640fcd887796578f8b7bd4017cb81f27729017020d0dff7ff00d64eae0119a
MD5 98501c231d07c0269970a61ac70f8765
BLAKE2b-256 9cc194291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page