Skip to main content

Extract summary from HTML text

Project description

Excerpt HTML

PyPI version PyPI Supported Python Versions GitHub license GitHub Actions (Tests)

This distribution provides a single function, excerpt_html whose purpose is to extract leading portions of HTML text. This is useful, for example, in order to be able to generate a summary of a blog post from the post body.

excerpt_html(html_text, min_words=50, cut_mark=r'(?i)\s*more\b')

The excerpt_html function expects, as input, HTML text, and returns a shortened version of that HTML text. The truncation point is found in one of two ways:

  • If an explicit cut-mark — an HTML comment whose text matches cut_mark — is found, the text will be truncated there.

  • If no explicit cut-mark is found, an attempt will be made to find a suitable implicit truncation point. Only points which are not within in-line markup are considered. The text will be truncated at the first such location found which preserves at least min_words (by default, 50) words of text.

In either case, the returned excerpt will always be a syntactically valid HTML fragment.

Arguments:

  • html_text: The input text, a string containing an HTML fragment.

  • min_words: When finding a block-level truncation point, retain at least this many words of the original text. Pass None to disable block-level truncation.

  • cut_mark: A regular expression which is to be matched against the text of HTML comments in html_text to find a truncation point. This is matched using re.match() against the contents of HTML comments. This should be either a compiled regular expression or a string; or None to disable cut-mark recognition.

Returns:

If a truncation point was found, a string containing the excerpt, a semantically valid HTML fragment, is returned.

If no suitable truncation point was found, None is returned.

Installation

The package is installable via pip.

pip install excerpt-html

Example

Here are two paragraphs worth of HTML, with an explicit cut-mark in the middle of the first paragraph.

>>> from excerpt_html import excerpt_html

>>> post_body = '''
... <p>
... In a sense, the subject is interpolated into a neotextual
... narrative that includes culture as a paradox.
... <!-- more -->
... A number of deconceptualisms concerning substructural
... construction exist.
... </p>
... <p>
... However, the subject is contextualised into a postmaterial
... discourse that includes sexuality as a totality. Sontag uses
... the term ‘cultural narrative’ to denote not, in fact,
... deconstruction, but predeconstruction.
... </p>'''

By default, the text will be truncated at the cut mark:

>>> summary = excerpt_html(post_body)
>>> print(summary)
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
</p>

If we disable cut_mark recognition, there is no suitable implicit truncation point which will preserve at least 50 words (the default value of min_words):

>>> summary = excerpt_html(post_body, cut_mark=None)

>>> summary is None
True

If we a lower value for min_words, the break between paragraphs will be selected as a truncation point:

>>> summary = excerpt_html(post_body, min_words=10, cut_mark=None)

>>> print(summary)          # doctest: +NORMALIZE_WHITESPACE
<p>
In a sense, the subject is interpolated into a neotextual
narrative that includes culture as a paradox.
<!-- more -->
A number of deconceptualisms concerning substructural
construction exist.
</p>

Links

Development takes place at GitHub. Releases may be downloaded from PyPI.

Author

Jeff Dairiki dairiki@dairiki.org

Changelog

Release 0.2 (2022-09-28)

  • Fix deprecation warnings from beautifulsoup4>=4.11.0.
  • Test under python 3.10
  • Drop support for python 3.6

Release 0.1 (2021-02-05)

No code changes.

Update development status classifier to "stable".

Release 0.1b2 (brown bag) (2020-05-09)

  • Remove spurious lektor entry point declaration from setup.cfg.

Release 0.1b1 (2020-05-09)

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excerpt-html-0.2.0.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

excerpt_html-0.2.0-py2.py3-none-any.whl (5.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file excerpt-html-0.2.0.tar.gz.

File metadata

  • Download URL: excerpt-html-0.2.0.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for excerpt-html-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7e43b25a18bafe25aa126d2c3ab9153b5b45560e87fcdbfb137e01ce0f41098e
MD5 291a980b384464f657ad2a18711163b7
BLAKE2b-256 54f82d1140934423e3b3f02dbce6cc96bc7fdf5741fc2ce1ae12214d15122de4

See more details on using hashes here.

File details

Details for the file excerpt_html-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: excerpt_html-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for excerpt_html-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 634d73092bb9b89f227979210b89f2a45fdfcb43b73741b47a692baec08ea8bb
MD5 995c94b90bbe76a2dd8a49cfd8c391fa
BLAKE2b-256 8cdc0aad592da3c0b561d4f9bfb81b8af3fb28e1eb19777b469f6387e2587f9d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page