Skip to main content

A Python package to extract the main content from HTML documents

Project description

HTML Content Extractor

This Python package provides a function to extract the "main content" from HTML documents.

Relevancy is determined by an algorithm that favors the deepest parent with the most h1, h2, h3 and p tags.

Installation

You can install this package via pip:

$ pip install html-content-extractor

Usage

from html_content_extractor import extract_content

>>> html = "<div><h1>An HTML Page</h1><p>This is some HTML content.</p></div>"
>>> content = extract_content(html, format='plaintext')
>>> print(content)
"An HTML Page\n\nThis is some HTML content."

>>> markdown = extract_content(html, format='markdown')
>>> print(content)
"# An HTML Page\n\nThis is some HTML content."

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_content_extractor-0.0.3.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

html_content_extractor-0.0.3-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file html_content_extractor-0.0.3.tar.gz.

File metadata

  • Download URL: html_content_extractor-0.0.3.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for html_content_extractor-0.0.3.tar.gz
Algorithm Hash digest
SHA256 1f31a04a293a0600ed06753399f3267d488ecc24cad6f06e328189901bef9821
MD5 406bf3cc5edee47d885889fa50e8ed22
BLAKE2b-256 fc29ffbe2ae1024a72a1599e0b050390cfb04518c03c61ca229221279d829d6f

See more details on using hashes here.

File details

Details for the file html_content_extractor-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for html_content_extractor-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9ced23f77bdf0e8d6917415a5a03c505ead700f55cc7bf1e801525fa293fdc84
MD5 cc7818d541743861c0c0d663486406e9
BLAKE2b-256 736c8039dec5861af227c12468769cd80082566d4c3f1deb99f586ce360cb654

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page