Skip to main content

Extract summary from HTML text

Project description

Excerpt HTML

PyPI version PyPI Supported Python Versions GitHub license GitHub Actions (Tests)

This distribution provides a single function, excerpt_html whose purpose is to extract leading portions of HTML text. This is useful, for example, in order to be able to generate a summary of a blog post from the post body.

excerpt_html(html_text, min_words=50, cut_mark=r'(?i)\s*more\b')

The excerpt_html function expects, as input, HTML text, and returns a shortened version of that HTML text. The truncation point is found in one of two ways:

  • If an explicit cut-mark — an HTML comment whose text matches cut_mark — is found, the text will be truncated there.

  • If no explicit cut-mark is found, an attempt will be made to find a suitable implicit truncation point. Only points which are not within in-line markup are considered. The text will be truncated at the first such location found which preserves at least min_words (by default, 50) words of text.

In either case, the returned excerpt will always be a syntactically valid HTML fragment.

Arguments:

  • html_text: The input text, a string containing an HTML fragment.

  • min_words: When finding a block-level truncation point, retain at least this many words of the original text. Pass None to disable block-level truncation.

  • cut_mark: A regular expression which is to be matched against the text of HTML comments in html_text to find a truncation point. This is matched using re.match() against the contents of HTML comments. This should be either a compiled regular expression or a string; or None to disable cut-mark recognition.

Returns:

If a truncation point was found, a string containing the excerpt, a semantically valid HTML fragment, is returned.

If no suitable truncation point was found, None is returned.

Installation

The package is installable via pip.

pip install excerpt-html

Example

Here are two paragraphs worth of HTML, with an explicit cut-mark in the middle of the first paragraph.

>>> from excerpt_html import excerpt_html

>>> post_body = '''
... <p>
... In a sense, the subject is interpolated into a neotextual
... narrative that includes culture as a paradox.
... <!-- more -->
... A number of deconceptualisms concerning substructural
... construction exist.
... </p>
... <p>
... However, the subject is contextualised into a postmaterial
... discourse that includes sexuality as a totality. Sontag uses
... the term ‘cultural narrative’ to denote not, in fact,
... deconstruction, but predeconstruction.
... </p>'''

By default, the text will be truncated at the cut mark:

>>> summary = excerpt_html(post_body)
>>> print(summary)
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
</p>

If we disable cut_mark recognition, there is no suitable implicit truncation point which will preserve at least 50 words (the default value of min_words):

>>> summary = excerpt_html(post_body, cut_mark=None)

>>> summary is None
True

If we a lower value for min_words, the break between paragraphs will be selected as a truncation point:

>>> summary = excerpt_html(post_body, min_words=10, cut_mark=None)

>>> print(summary)          # doctest: +NORMALIZE_WHITESPACE
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
<!-- more -->
A number of deconceptualisms concerning substructural
construction exist.
</p>

Links

Development takes place at GitHub. Releases may be downloaded from PyPI.

Author

Jeff Dairiki dairiki@dairiki.org

Changelog

Release 0.2 (2022-09-28)

  • Fix deprecation warnings from beautifulsoup4>=4.11.0.
  • Test under python 3.10
  • Drop support for python 3.6

Release 0.1 (2021-02-05)

No code changes.

Update development status classifier to "stable".

Release 0.1b2 (brown bag) (2020-05-09)

  • Remove spurious lektor entry point declaration from setup.cfg.

Release 0.1b1 (2020-05-09)

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excerpt-html-0.2.0.tar.gz (6.9 kB view hashes)

Uploaded Source

Built Distribution

excerpt_html-0.2.0-py2.py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page