Skip to main content

fast html to text parser (article readability tool) with python 3 support

Reason this release was yanked:

broken cjk get_title

Project description

PyPI version

python-readability

Given an HTML document, extract and clean up the main body text and title.

This is a Python port of a Ruby port of arc90's Readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

As an alternative, you may also use conda to install, just run:

$ conda install -c conda-forge readability-lxml

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8.4 Better CJK support, thanks @cdhigh
  • 0.8.3.1 Support for python 3.8 - 3.13
  • 0.8.3 We can now save all images via keep_all_images=True (default is to save 1 main image), thanks @botlabsDev
  • 0.8.2 Added article author(s) (thanks @mattblaha)
  • 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
  • 0.8 Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

  • Latest readability.js
  • Ruby port by starrhorne and iterationlabs
  • Python port by gfxmonk
  • Decruft effort to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-lxml-0.8.4.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

readability_lxml-0.8.4-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file readability-lxml-0.8.4.tar.gz.

File metadata

  • Download URL: readability-lxml-0.8.4.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for readability-lxml-0.8.4.tar.gz
Algorithm Hash digest
SHA256 fd1b8d1c2e1440546d8ecd336c177f6021bbd00e531de39d86be5103484304d9
MD5 3e20391a3bcc1d20c0215c0c5067ade7
BLAKE2b-256 9a741beba1283d59a562cbcfe1a09662b971ee9cc626ee40c55bdcde7bef66b1

See more details on using hashes here.

File details

Details for the file readability_lxml-0.8.4-py3-none-any.whl.

File metadata

File hashes

Hashes for readability_lxml-0.8.4-py3-none-any.whl
Algorithm Hash digest
SHA256 18a64e5fa54a9202dff947e87edbeac7654bfc722632d64ea667698a3000c877
MD5 3b99c023f7ffe79f0ca885bbaed27f45
BLAKE2b-256 4eacad208edff49d3232428ce2215396b3ad453d2d05960e9152a2563e9c2e44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page