Skip to main content

fast python port of arc90's readability tool

Project description

This code is under the Apache License 2.0. http://www.apache.org/licenses/LICENSE-2.0

This is a python port of a ruby port of arc90’s readability project

http://lab.arc90.com/experiments/readability/

In few words, Given a html document, it pulls out the main body text and cleans it up. It also can clean up title based on latest readability.js code.

Based on:

Installation:

easy_install readability-lxml
or
pip install readability-lxml

Usage:

from readability.readability import Document
import urllib
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()

Command-line usage:

python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml

Using positive/negative keywords example:

python -m readability.readability -p intro -n newsindex,homepage-box,news-section -u http://python.org

Document() kwarg options:

  • attributes:

  • debug: output debug messages

  • min_text_length:

  • retry_length:

  • url: will allow adjusting links to be absolute

  • positive_keywords: the list of positive search patterns in classes and ids, for example: [“news-item”, “block”]

  • negative_keywords: the list of negative search patterns in classes and ids, for example: [“mysidebar”, “related”, “ads”]

Updates

  • 0.2.5 Update setup.py for uploading .tar.gz to pypi

  • 0.2.6 Don’t crash on documents with no title

  • 0.2.6.1 Document.short_title() properly works

  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability-lxml-0.3.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

readability_lxml-0.3-py2.7.egg (25.4 kB view details)

Uploaded Egg

File details

Details for the file readability-lxml-0.3.tar.gz.

File metadata

File hashes

Hashes for readability-lxml-0.3.tar.gz
Algorithm Hash digest
SHA256 20dbbe0613092519b8a58b1c932f36fb154884053df80ed208ad9b67dae1d00f
MD5 31eb639ea43f46e32f7a1a94f7377c6e
BLAKE2b-256 b7c8d80abc0f80a495a1f674ca091afe5fa0f414221afbf3967155f5604b7a34

See more details on using hashes here.

File details

Details for the file readability_lxml-0.3-py2.7.egg.

File metadata

File hashes

Hashes for readability_lxml-0.3-py2.7.egg
Algorithm Hash digest
SHA256 94eb1211a1beee5821bfbb7d78345a3b0dfdcf5c145216c82d3d10104c2ff23c
MD5 eddfc596b9196cb8e4fa970dbb2f5e67
BLAKE2b-256 0ea1f2e871c2994c01ab357ea5eb7a123f2555466fd3968c90a4bda70c3a05c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page