Skip to main content

fast html to text parser (article readability tool) with python3 support

Project description

https://travis-ci.org/buriy/python-readability.svg?branch=master

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90’s readability project.

Installation

It’s easy using pip, just run:

$ pip install readability-lxml

Usage

>> import requests
>> from readability import Document
>>
>> response = requests.get('http://example.com')
>> doc = Document(response.text)
>> doc.title()
>> 'Example Domain'
>> doc.summary()
>> u'<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>'

Change Log

  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.

  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).

  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6

  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4

  • 0.4 Added Videos loading and allowed more images per paragraph

  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

  • Latest readability.js

  • Ruby port by starrhorne and iterationlabs

  • Python port by gfxmonk

  • Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml

  • “BR to P” fix from readability.js which improves quality for smaller texts

  • Github users contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-lxml-0.7.1.tar.gz (15.7 kB view details)

Uploaded Source

File details

Details for the file readability-lxml-0.7.1.tar.gz.

File metadata

  • Download URL: readability-lxml-0.7.1.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/2.7

File hashes

Hashes for readability-lxml-0.7.1.tar.gz
Algorithm Hash digest
SHA256 87cb722e53a4a5749effe37fb1236abc52a856ce71113324d06b25d96b48147b
MD5 d204b53ecf1f2d4d51ceabdf273030aa
BLAKE2b-256 afa78ea52b2d3de4a95c3ed8255077618435546386e35af8969744c0fa82d0d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page