Skip to main content

Simple Python content extractor for html

Project description

peduncle

very very very simple DOM based HTML content extraction tool, get rid of boilerplate dressing of a web page[1].

easy but useable

work with python 3.7+

[1] the word comes from dragnet.

install

pip install peduncle

usage

import requests
from peduncle.peduncle import extract_text

# obtain the raw html
url="https://blog.rust-lang.org/2023/05/29/RustConf.html"
html = requests.get(url).text

# extract
print(extract_text(html))

benchmark

data

benchmark data comes from dragnet_data, which contains 1381 web pages.

result

similarity 95%hit_rate avg_length_gap(char) length_gap_std
a=0.01 0.5767456743946341 0.22 -4673.118 15343.704819895227
a=025 0.8451692708814662 0.548 -2082.988 14502.183923390849
a=0.5 0.8226224698726087 0.47 -368.696 8452.075615349402
a=0.99 0.7527591593485807 0.292 1614.306 7917.618208044891
  • a: alpha, control how much the content extractor tens to extract larger content piece
  • similarity: cosine similarity between sparse vectors of answer and extracted text
  • 95hit rate: percentage of similarity larger than 95%
  • length gap: extracted text length - answer text length
  • std: std

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

peduncle-0.0.2.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

peduncle-0.0.2-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file peduncle-0.0.2.tar.gz.

File metadata

  • Download URL: peduncle-0.0.2.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for peduncle-0.0.2.tar.gz
Algorithm Hash digest
SHA256 cd0dc6a7d9bad888ebb819db7ef8903839c4e31fdf38414e8b02110e904b068b
MD5 8b864faba56d8c887e279a0bc3ba8431
BLAKE2b-256 488da29dce88ccf0fad1da6b212a0cfe2a4d608ed2703c822ab9359a83a63fdf

See more details on using hashes here.

File details

Details for the file peduncle-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: peduncle-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 4.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for peduncle-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1a4ec0219ff0accfdcf1f3d6b525aebb2412c75cbe4a4937307f1863c6a284c6
MD5 f2f369028eeee35bf6444bf746d42754
BLAKE2b-256 a54ad359f0c57754bbfe71cd0494f2d3c6bbfdc87f9c82294e09b42180269082

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page