cleave html headers and text
Project description
HTML Cleaver 🍀🦫
Tool for parsing HTML into a chain of chunks with relevant headers.
The API entry-point is in src/html_cleaver/cleaver
.
The logical algorithm and data-structures are in src/html_cleaver/handler
.
This is a "tree-capitator" if you will,
cleaving headers together while cleaving text apart.
Quickstart:
pip install html-cleaver
Optionally, if you're working with HTML that requires javascript rendering:
pip install selenium
Testing an example on the command-line:
python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/
Example usage:
Cleaving pages of varying difficulties:
from html_cleaver.cleaver import get_cleaver
# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
# handle chunk-events directly
# (example of favorable structure yielding high-quality chunks)
cleaver.parse_events(
["https://plato.stanford.edu/entries/goedel/"],
print)
# get collection of chunks
# (example of moderate structure yielding medium-quality chunks)
for c in cleaver.parse_chunk_sequence(
["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
print(c)
# sequence of chunks from sequence of pages
# (examples of challenging structure yielding poor-quality chunks)
l = [
"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
for c in cleaver.parse_chunk_sequence(l):
print(c)
# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
cleaver.parse_events(
["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
print)
Example usage with Selenium:
Using selenium on a page that requires javascript to load contents:
from html_cleaver.cleaver import get_cleaver
print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
Development:
Testing:
Testing without Poetry:
pip install lxml
pip install selenium
python -m unittest discover -s src
Testing with Poetry:
poetry install
poetry run pytest
Build:
Building from source:
rm dist/*
python -m build
Installing from the build:
pip install dist/*.whl
Publishing from the build:
python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*
python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for html_cleaver-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3640fcd887796578f8b7bd4017cb81f27729017020d0dff7ff00d64eae0119a |
|
MD5 | 98501c231d07c0269970a61ac70f8765 |
|
BLAKE2b-256 | 9cc194291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a |