cleave html headers and text
Project description
HTML Cleaver 🍀🦫
Tool for parsing HTML into a chain of chunks with relevant headers.
The API entry-point is in src/html_cleaver/cleaver
.
The logical algorithm and data-structures are in src/html_cleaver/handler
.
This is a "tree-capitator" if you will,
cleaving headers together while cleaving text apart.
Quickstart:
pip install html-cleaver
Optionally, if you're working with HTML that requires javascript rendering:
pip install selenium
Testing an example on the command-line:
python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/
Example usage:
Cleaving pages of varying difficulties:
from html_cleaver.cleaver import get_cleaver
# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
# handle chunk-events directly
# (example of favorable structure yielding high-quality chunks)
cleaver.parse_events(
["https://plato.stanford.edu/entries/goedel/"],
print)
# get collection of chunks
# (example of moderate structure yielding medium-quality chunks)
for c in cleaver.parse_chunk_sequence(
["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
print(c)
# sequence of chunks from sequence of pages
# (examples of challenging structure yielding poor-quality chunks)
l = [
"https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
"https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
for c in cleaver.parse_chunk_sequence(l):
print(c)
# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
cleaver.parse_events(
["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
print)
Example usage with Selenium:
Using selenium on a page that requires javascript to load contents:
from html_cleaver.cleaver import get_cleaver
print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
cleaver.parse_events(
["https://www.youtube.com/watch?v=rfscVS0vtbw"],
print)
Development:
Testing:
Testing without Poetry:
pip install lxml
pip install selenium
python -m unittest discover -s src
Testing with Poetry:
poetry install
poetry run pytest
Build:
Building from source:
rm dist/*
python -m build
Installing from the build:
pip install dist/*.whl
Publishing from the build:
python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*
python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file html_cleaver-0.3.0.tar.gz
.
File metadata
- Download URL: html_cleaver-0.3.0.tar.gz
- Upload date:
- Size: 15.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7901934f083e6f36682645e314cea599a2cb2e8139e1a3a5ab581235f0e3839 |
|
MD5 | 5b17920daf103e4824e8218d379afe48 |
|
BLAKE2b-256 | c20b6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05 |
File details
Details for the file html_cleaver-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: html_cleaver-0.3.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3640fcd887796578f8b7bd4017cb81f27729017020d0dff7ff00d64eae0119a |
|
MD5 | 98501c231d07c0269970a61ac70f8765 |
|
BLAKE2b-256 | 9cc194291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a |