Skip to main content

fast html to text parser (article readability tool) with python3 support

Project description

.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
:target: https://travis-ci.org/buriy/python-readability


python-readability
==================

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of `arc90's readability
project <http://lab.arc90.com/experiments/readability/>`__.

Installation
------------

It's easy using ``pip``, just run:

::

$ pip install readability-lxml

Usage
-----

::

>> import requests
>> from readability import Document
>>
>> response = requests.get('http://example.com')
>> doc = Document(response.text)
>> doc.title()
>> 'Example Domain'

Change Log
----------

- 0.7 Improved HTML5 tags handling. Heuristics were changed for a lot of sites: Fixed an important
bug with stripping unwanted HTML nodes (only first matching node was removed before).
- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3
and 3.4
- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and
3.4
- 0.4 Added Videos loading and allowed more images per paragraph
- 0.3 Added Document.encoding, positive\_keywords and
negative\_keywords

Licensing
=========

This code is under `the Apache License
2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.

Thanks to
---------

- Latest
`readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__
- Ruby port by starrhorne and iterationlabs
- `Python port <https://github.com/gfxmonk/python-readability>`__ by
gfxmonk
- `Decruft
effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>`__
to move to lxml
- "BR to P" fix from readability.js which improves quality for smaller
texts
- Github users contributions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-lxml-0.7.tar.gz (14.8 kB view details)

Uploaded Source

File details

Details for the file readability-lxml-0.7.tar.gz.

File metadata

File hashes

Hashes for readability-lxml-0.7.tar.gz
Algorithm Hash digest
SHA256 b6b30684e302802cdab490dc5093555bfebba663eb569814225939c6b0dead3f
MD5 a096ccc23a6af3e5570bd360c3e3d417
BLAKE2b-256 b07c807b783c1e7f9c2e3f86573f644771112813e9ee94c1d610811c7acc7562

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page