Extract summary from HTML text
Project description
Excerpt HTML
This distribution provides a single function, excerpt_html
whose
purpose is to extract leading portions of HTML text. This is useful,
for example, in order to be able to generate a summary of a blog post
from the post body.
excerpt_html(html_text, min_words=50, cut_mark=r'(?i)\s*more\b')
The excerpt_html
function expects, as input, HTML text, and returns
a shortened version of that HTML text. The truncation point is found
in one of two ways:
-
If an explicit cut-mark — an HTML comment whose text matches cut_mark — is found, the text will be truncated there.
-
If no explicit cut-mark is found, an attempt will be made to find a suitable implicit truncation point. Only points which are not within in-line markup are considered. The text will be truncated at the first such location found which preserves at least min_words (by default, 50) words of text.
In either case, the returned excerpt will always be a syntactically valid HTML fragment.
Arguments:
-
html_text
: The input text, a string containing an HTML fragment. -
min_words
: When finding a block-level truncation point, retain at least this many words of the original text. PassNone
to disable block-level truncation. -
cut_mark
: A regular expression which is to be matched against the text of HTML comments inhtml_text
to find a truncation point. This is matched usingre.match()
against the contents of HTML comments. This should be either a compiled regular expression or a string; orNone
to disable cut-mark recognition.
Returns:
If a truncation point was found, a string containing the excerpt, a semantically valid HTML fragment, is returned.
If no suitable truncation point was found, None
is returned.
Installation
The package is installable via pip.
pip install excerpt-html
Example
Here are two paragraphs worth of HTML, with an explicit cut-mark in the middle of the first paragraph.
>>> from excerpt_html import excerpt_html
>>> post_body = '''
... <p>
... In a sense, the subject is interpolated into a neotextual
... narrative that includes culture as a paradox.
... <!-- more -->
... A number of deconceptualisms concerning substructural
... construction exist.
... </p>
... <p>
... However, the subject is contextualised into a postmaterial
... discourse that includes sexuality as a totality. Sontag uses
... the term ‘cultural narrative’ to denote not, in fact,
... deconstruction, but predeconstruction.
... </p>'''
By default, the text will be truncated at the cut mark:
>>> summary = excerpt_html(post_body)
>>> print(summary)
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
</p>
If we disable cut_mark recognition, there is no suitable implicit truncation point which will preserve at least 50 words (the default value of min_words):
>>> summary = excerpt_html(post_body, cut_mark=None)
>>> summary is None
True
If we a lower value for min_words, the break between paragraphs will be selected as a truncation point:
>>> summary = excerpt_html(post_body, min_words=10, cut_mark=None)
>>> print(summary) # doctest: +NORMALIZE_WHITESPACE
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
<!-- more -->
A number of deconceptualisms concerning substructural
construction exist.
</p>
Links
Development takes place at GitHub. Releases may be downloaded from PyPI.
Author
Jeff Dairiki dairiki@dairiki.org
Changelog
Release 0.1b2 (brown bag)
Released on May 9, 2020.
- Remove spurious lektor entry point declaration from
setup.cfg
.
Release 0.1b1
Released on May 9, 2020.
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for excerpt_html-0.1b2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffc1dc6fbaf6f95f2e874d417beb0ab09ef04e386afa5e2df0dd26bd6436d5de |
|
MD5 | 1cdec65a2791a0cd2a53407d8808065a |
|
BLAKE2b-256 | 606e2a56dcaad2ea7a8365f0b553dc9362506fec15a868577263eaac0f65a1a9 |