Extract summary from HTML text
Project description
Excerpt HTML
This distribution provides a single function, excerpt_html
whose
purpose is to extract leading portions of HTML text. This is useful,
for example, in order to be able to generate a summary of a blog post
from the post body.
excerpt_html(html_text, min_words=50, cut_mark=r'(?i)\s*more\b')
The excerpt_html
function expects, as input, HTML text, and returns
a shortened version of that HTML text. The truncation point is found
in one of two ways:
-
If an explicit cut-mark — an HTML comment whose text matches cut_mark — is found, the text will be truncated there.
-
If no explicit cut-mark is found, an attempt will be made to find a suitable implicit truncation point. Only points which are not within in-line markup are considered. The text will be truncated at the first such location found which preserves at least min_words (by default, 50) words of text.
In either case, the returned excerpt will always be a syntactically valid HTML fragment.
Arguments:
-
html_text
: The input text, a string containing an HTML fragment. -
min_words
: When finding a block-level truncation point, retain at least this many words of the original text. PassNone
to disable block-level truncation. -
cut_mark
: A regular expression which is to be matched against the text of HTML comments inhtml_text
to find a truncation point. This is matched usingre.match()
against the contents of HTML comments. This should be either a compiled regular expression or a string; orNone
to disable cut-mark recognition.
Returns:
If a truncation point was found, a string containing the excerpt, a semantically valid HTML fragment, is returned.
If no suitable truncation point was found, None
is returned.
Installation
The package is installable via pip.
pip install excerpt-html
Example
Here are two paragraphs worth of HTML, with an explicit cut-mark in the middle of the first paragraph.
>>> from excerpt_html import excerpt_html
>>> post_body = '''
... <p>
... In a sense, the subject is interpolated into a neotextual
... narrative that includes culture as a paradox.
... <!-- more -->
... A number of deconceptualisms concerning substructural
... construction exist.
... </p>
... <p>
... However, the subject is contextualised into a postmaterial
... discourse that includes sexuality as a totality. Sontag uses
... the term ‘cultural narrative’ to denote not, in fact,
... deconstruction, but predeconstruction.
... </p>'''
By default, the text will be truncated at the cut mark:
>>> summary = excerpt_html(post_body)
>>> print(summary)
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
</p>
If we disable cut_mark recognition, there is no suitable implicit truncation point which will preserve at least 50 words (the default value of min_words):
>>> summary = excerpt_html(post_body, cut_mark=None)
>>> summary is None
True
If we a lower value for min_words, the break between paragraphs will be selected as a truncation point:
>>> summary = excerpt_html(post_body, min_words=10, cut_mark=None)
>>> print(summary) # doctest: +NORMALIZE_WHITESPACE
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
<!-- more -->
A number of deconceptualisms concerning substructural
construction exist.
</p>
Links
Development takes place at GitHub. Releases may be downloaded from PyPI.
Author
Jeff Dairiki dairiki@dairiki.org
Changelog
Release 0.2 (2022-09-28)
- Fix deprecation warnings from
beautifulsoup4>=4.11.0
. - Test under python 3.10
- Drop support for python 3.6
Release 0.1 (2021-02-05)
No code changes.
Update development status classifier to "stable".
Release 0.1b2 (brown bag) (2020-05-09)
- Remove spurious lektor entry point declaration from
setup.cfg
.
Release 0.1b1 (2020-05-09)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for excerpt_html-0.2.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 634d73092bb9b89f227979210b89f2a45fdfcb43b73741b47a692baec08ea8bb |
|
MD5 | 995c94b90bbe76a2dd8a49cfd8c391fa |
|
BLAKE2b-256 | 8cdc0aad592da3c0b561d4f9bfb81b8af3fb28e1eb19777b469f6387e2587f9d |