Extract summary from HTML text
Project description
Excerpt HTML
This distribution provides a single function, excerpt_html
whose
purpose is to extract leading portions of HTML text. This is useful,
for example, in order to be able to generate a summary of a blog post
from the post body.
excerpt_html(html_text, min_words=50, cut_mark=r'(?i)\s*more\b')
The excerpt_html
function expects, as input, HTML text, and returns
a shortened version of that HTML text. The truncation point is found
in one of two ways:
-
If an explicit cut-mark — an HTML comment whose text matches cut_mark — is found, the text will be truncated there.
-
If no explicit cut-mark is found, an attempt will be made to find a suitable implicit truncation point. Only points which are not within in-line markup are considered. The text will be truncated at the first such location found which preserves at least min_words (by default, 50) words of text.
In either case, the returned excerpt will always be a syntactically valid HTML fragment.
Arguments:
-
html_text
: The input text, a string containing an HTML fragment. -
min_words
: When finding a block-level truncation point, retain at least this many words of the original text. PassNone
to disable block-level truncation. -
cut_mark
: A regular expression which is to be matched against the text of HTML comments inhtml_text
to find a truncation point. This is matched usingre.match()
against the contents of HTML comments. This should be either a compiled regular expression or a string; orNone
to disable cut-mark recognition.
Returns:
If a truncation point was found, a string containing the excerpt, a semantically valid HTML fragment, is returned.
If no suitable truncation point was found, None
is returned.
Installation
The package is installable via pip.
pip install excerpt-html
Example
Here are two paragraphs worth of HTML, with an explicit cut-mark in the middle of the first paragraph.
>>> from excerpt_html import excerpt_html
>>> post_body = '''
... <p>
... In a sense, the subject is interpolated into a neotextual
... narrative that includes culture as a paradox.
... <!-- more -->
... A number of deconceptualisms concerning substructural
... construction exist.
... </p>
... <p>
... However, the subject is contextualised into a postmaterial
... discourse that includes sexuality as a totality. Sontag uses
... the term ‘cultural narrative’ to denote not, in fact,
... deconstruction, but predeconstruction.
... </p>'''
By default, the text will be truncated at the cut mark:
>>> summary = excerpt_html(post_body)
>>> print(summary)
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
</p>
If we disable cut_mark recognition, there is no suitable implicit truncation point which will preserve at least 50 words (the default value of min_words):
>>> summary = excerpt_html(post_body, cut_mark=None)
>>> summary is None
True
If we a lower value for min_words, the break between paragraphs will be selected as a truncation point:
>>> summary = excerpt_html(post_body, min_words=10, cut_mark=None)
>>> print(summary) # doctest: +NORMALIZE_WHITESPACE
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
<!-- more -->
A number of deconceptualisms concerning substructural
construction exist.
</p>
Links
Development takes place at GitHub. Releases may be downloaded from PyPI.
Author
Jeff Dairiki dairiki@dairiki.org
Changelog
Release 0.2 (2022-09-28)
- Fix deprecation warnings from
beautifulsoup4>=4.11.0
. - Test under python 3.10
- Drop support for python 3.6
Release 0.1 (2021-02-05)
No code changes.
Update development status classifier to "stable".
Release 0.1b2 (brown bag) (2020-05-09)
- Remove spurious lektor entry point declaration from
setup.cfg
.
Release 0.1b1 (2020-05-09)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file excerpt-html-0.2.0.tar.gz
.
File metadata
- Download URL: excerpt-html-0.2.0.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e43b25a18bafe25aa126d2c3ab9153b5b45560e87fcdbfb137e01ce0f41098e |
|
MD5 | 291a980b384464f657ad2a18711163b7 |
|
BLAKE2b-256 | 54f82d1140934423e3b3f02dbce6cc96bc7fdf5741fc2ce1ae12214d15122de4 |
File details
Details for the file excerpt_html-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: excerpt_html-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 5.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 634d73092bb9b89f227979210b89f2a45fdfcb43b73741b47a692baec08ea8bb |
|
MD5 | 995c94b90bbe76a2dd8a49cfd8c391fa |
|
BLAKE2b-256 | 8cdc0aad592da3c0b561d4f9bfb81b8af3fb28e1eb19777b469f6387e2587f9d |