Simple Python content extractor for html
Project description
peduncle
very very very simple DOM based HTML content extraction tool, get rid of boilerplate dressing of a web page[1].
easy but useable
work with python 3.7+
[1] the word comes from dragnet.
install
pip install peduncle
usage
import requests
from peduncle.peduncle import extract_text
# obtain the raw html
url="https://blog.rust-lang.org/2023/05/29/RustConf.html"
html = requests.get(url).text
# extract
print(extract_text(html))
benchmark
data
benchmark data comes from dragnet_data, which contains 1381 web pages.
result
similarity | 95%hit_rate | avg_length_gap(char) | length_gap_std | |
---|---|---|---|---|
a=0.01 | 0.5767456743946341 | 0.22 | -4673.118 | 15343.704819895227 |
a=025 | 0.8451692708814662 | 0.548 | -2082.988 | 14502.183923390849 |
a=0.5 | 0.8226224698726087 | 0.47 | -368.696 | 8452.075615349402 |
a=0.99 | 0.7527591593485807 | 0.292 | 1614.306 | 7917.618208044891 |
- a: alpha, control how much the content extractor tens to extract larger content piece
- similarity: cosine similarity between sparse vectors of answer and extracted text
- 95hit rate: percentage of similarity larger than 95%
- length gap: extracted text length - answer text length
- std: std
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
peduncle-0.0.2.tar.gz
(5.4 kB
view details)
Built Distribution
File details
Details for the file peduncle-0.0.2.tar.gz
.
File metadata
- Download URL: peduncle-0.0.2.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
cd0dc6a7d9bad888ebb819db7ef8903839c4e31fdf38414e8b02110e904b068b
|
|
MD5 |
8b864faba56d8c887e279a0bc3ba8431
|
|
BLAKE2b-256 |
488da29dce88ccf0fad1da6b212a0cfe2a4d608ed2703c822ab9359a83a63fdf
|
File details
Details for the file peduncle-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: peduncle-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
1a4ec0219ff0accfdcf1f3d6b525aebb2412c75cbe4a4937307f1863c6a284c6
|
|
MD5 |
f2f369028eeee35bf6444bf746d42754
|
|
BLAKE2b-256 |
a54ad359f0c57754bbfe71cd0494f2d3c6bbfdc87f9c82294e09b42180269082
|