Skip to main content

Extract plain text from HTML

Project description

Extract plain text from HTML

How is plainhtml different from .xpath("//text()") from lxml or .get_text() from bs4?

  • Text extracted with plainhtml does not contain inline styles, JavaScript, comments and other text that is not normally visible to users
  • plainhtml normalizes whitespace, but in a way smarter than .xpath("normalize-space()"), adding spaces around inline elements (which are often used as block elements in HTML), and trying to avoid adding extra spaces for punctuation
  • plainhtml can add newlines (e.g. after headers or paragraphs), so that the output text looks more like how it is rendered in browsers

Installation

$ pip install plainhtml

Example

>>> import plainhtml
>>> html = "<html><body><p>foo</p><p>bar</p></body></html>"
>>> plainhtml.extract_text(html)
'foo\n\nbar'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

plainhtml-0.1.2-py3-none-any.whl (5.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page