A Python package to extract the main content from HTML documents

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Markup :: HTML

Project description

HTML Content Extractor

This Python package provides a function to extract the "main content" from HTML documents.

Relevancy is determined by an algorithm that favors the deepest parent with the most h1, h2, h3 and p tags.

Installation

You can install this package via pip:

$ pip install html-content-extractor

Usage

from html_content_extractor import extract_content

>>> html = "<div><h1>An HTML Page</h1><p>This is some HTML content.</p></div>"
>>> content = extract_content(html, format='plaintext')
>>> print(content)
"An HTML Page\n\nThis is some HTML content."

>>> markdown = extract_content(html, format='markdown')
>>> print(content)
"# An HTML Page\n\nThis is some HTML content."

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Text Processing :: Markup :: HTML

Release history Release notifications | RSS feed

This version

0.0.3

Jul 30, 2023

0.0.2

Jul 28, 2023

0.0.1

Jul 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_content_extractor-0.0.3.tar.gz (8.4 kB view details)

Uploaded Jul 30, 2023 Source

Built Distribution

html_content_extractor-0.0.3-py3-none-any.whl (12.1 kB view details)

Uploaded Jul 30, 2023 Python 3

File details

Details for the file html_content_extractor-0.0.3.tar.gz.

File metadata

Download URL: html_content_extractor-0.0.3.tar.gz
Upload date: Jul 30, 2023
Size: 8.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for html_content_extractor-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`1f31a04a293a0600ed06753399f3267d488ecc24cad6f06e328189901bef9821`
MD5	`406bf3cc5edee47d885889fa50e8ed22`
BLAKE2b-256	`fc29ffbe2ae1024a72a1599e0b050390cfb04518c03c61ca229221279d829d6f`

See more details on using hashes here.

File details

Details for the file html_content_extractor-0.0.3-py3-none-any.whl.

File metadata

Download URL: html_content_extractor-0.0.3-py3-none-any.whl
Upload date: Jul 30, 2023
Size: 12.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for html_content_extractor-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ced23f77bdf0e8d6917415a5a03c505ead700f55cc7bf1e801525fa293fdc84`
MD5	`cc7818d541743861c0c0d663486406e9`
BLAKE2b-256	`736c8039dec5861af227c12468769cd80082566d4c3f1deb99f586ce360cb654`

See more details on using hashes here.

html-content-extractor 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HTML Content Extractor

Installation

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes