Skip to main content

fast html to text parser (article readability tool) with python 3 support

Project description

https://travis-ci.org/buriy/python-readability.svg?branch=master

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90’s readability project.

Installation

It’s easy using pip, just run:

$ pip install readability-lxml

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8.1 Fixed processing of non-ascii HTMLs via regexps.

  • 0.8 Replaced XHTML output with HTML5 output in summary() call.

  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.

  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).

  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6

  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4

  • 0.4 Added Videos loading and allowed more images per paragraph

  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

  • Latest readability.js

  • Ruby port by starrhorne and iterationlabs

  • Python port by gfxmonk

  • Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml

  • “BR to P” fix from readability.js which improves quality for smaller texts

  • Github users contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-lxml-0.8.1.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

readability_lxml-0.8.1-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file readability-lxml-0.8.1.tar.gz.

File metadata

  • Download URL: readability-lxml-0.8.1.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.8.0 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for readability-lxml-0.8.1.tar.gz
Algorithm Hash digest
SHA256 e51fea56b5909aaf886d307d48e79e096293255afa567b7d08bca94d25b1a4e1
MD5 dd153878f06608bd487f36a29d21cc5a
BLAKE2b-256 b9626de3a9a8524c1a1ee0f2aee0dfbad13a36ebbca0db402abcf4e790496512

See more details on using hashes here.

File details

Details for the file readability_lxml-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: readability_lxml-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/46.0.0 requests-toolbelt/0.8.0 tqdm/4.47.0 CPython/3.7.7

File hashes

Hashes for readability_lxml-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e0d366a21b1bd6cca17de71a4e6ea16fcfaa8b0a5b4004e39e2c7eff884e6305
MD5 6a0dc326b843d99346d2afc44d2b4faa
BLAKE2b-256 39a6cfe22aaa19ac69b97d127043a76a5bbcb0ef24f3a0b22793c46608190caa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page