Skip to main content

fast python port of arc90's readability tool

Project description

This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0

This is a python port of a ruby port of arc90’s readability project

http://lab.arc90.com/experiments/readability/

In few words, Given a html document, it pulls out the main body text and cleans it up. It also can clean up title based on latest readability.js code.

Based on:

Installation:

easy_install readability-lxml
or
pip install readability-lxml

Usage:

from readability.readability import Document
import urllib
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()

Command-line usage:

python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml

Using positive/negative keywords example:

python -m readability.readability -p intro -n newsindex,homepage-box,news-section -u http://python.org

Document() kwarg options:

  • attributes:

  • debug: output debug messages

  • min_text_length:

  • retry_length:

  • url: will allow adjusting links to be absolute

  • positive_keywords: the list of positive search patterns in classes and ids, for example: [“news-item”, “block”]

  • negative_keywords: the list of negative search patterns in classes and ids, for example: [“mysidebar”, “related”, “ads”]

Updates

  • 0.2.5 Update setup.py for uploading .tar.gz to pypi

  • 0.2.6 Don’t crash on documents with no title

  • 0.2.6.1 Document.short_title() properly works

  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-lxml-0.3.0.1.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

readability_lxml-0.3.0.1-py2.7.egg (25.2 kB view details)

Uploaded Egg

File details

Details for the file readability-lxml-0.3.0.1.tar.gz.

File metadata

File hashes

Hashes for readability-lxml-0.3.0.1.tar.gz
Algorithm Hash digest
SHA256 83ec8c63bd17155083a5a3d3b5b012032eb0234bee5a91842ec13e15697a2bc8
MD5 9961bd647804f4ec707e420d9907c446
BLAKE2b-256 09dfa7e4f7744d5becf6b13dd6cfca768c6dcd5beae4d49d0fd452c921a1887f

See more details on using hashes here.

File details

Details for the file readability_lxml-0.3.0.1-py2.7.egg.

File metadata

File hashes

Hashes for readability_lxml-0.3.0.1-py2.7.egg
Algorithm Hash digest
SHA256 99ce6cf83e06ac8ec1382210ee48c5117bff0a8765ba59b2e3a1318a03dc42a8
MD5 f3f688654866608530ec71526f57779d
BLAKE2b-256 0cf043200a64a4e0fc1a8c4ecadcaad9b3ab211aa268aeca2b46b9e4af9660af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page