Skip to main content
Join the official Python Developers Survey 2018 and win valuable prizes: Start the survey!

A featherweight tool used to extract an article's text in html documents.

Project description

A python package for extracting article text in html documents. Check out this demo.

What people have been saying

You should write a paper on this work - /u/queue_cumber

This is neat-o. A short and sweet project… - /u/CandyCorns_

From a quick glance this looks super elegant! Very neat idea! - /u/worldsayshi

At a Glance

To install:

pip install eatiht
...
easy_install eatiht

Note: On Windows, you may need to install lxml manually using: pip install lxml

Using in Python

import eatiht

url = 'http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html'

print eatiht.extract(url)
Output
NASA's Curiosity rover is continuing to help scientists piece together the mystery of how Mars lost its
surface water over the course of billions of years. The rover drilled into a piece of Martian rock called
Cumberland and found some ancient water hidden within it. Researchers were then able to test a key ratio
in the water with Curiosity's onboard instruments...

Using as a command line tool:

eatiht http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html >> out.txt

Note: Window’s users may have to add the C:directory to your “path” so that the command line tool works from any directory, not only the ..directory.

Requirements

requests
lxml

Motivation

After searching through the deepest crevices of the internet for some tool|library|module that could effectively extract the main content from a website (ignoring text from ads, sidebar links, etc.), I was slightly disheartened by the apparent ambiguity caused by this content-extraction problem.

My survey resulted in some of the following solutions:

The number of research papers I found on the subject largely outweighs the number available open-source projects. This is my attempt at balancing out the disparity.

In the process of coming up with a solution, I made two unoriginal observations:

  1. XPath’s select all (//), parent node (..) queries and functions (‘string-length’) are remarkably powerful when used together
  2. Unnecessary machine learning is unnecessary

By making an assumption on sentence length, and this is trivial, one can query for text-nodes satisfying said sentence length, then create a frequency distribution (histogram) across the parent-nodes, and the argmax of the resulting distribution is the xpath that is shared amongst likely sentences.

The results were surprisingly good. I personally prefer this approach to the others as it seems to lie somewhere in between the purely rule-based and the drowning-in-ML approaches.

Issues or Contact

Please raise any issues or yell at me at rodrigopala91@gmail.com or [@rodricios](https://twitter.com/rodricios)

TODO:

  • [STRIKEOUT:Add newline and tab options for printing.] Please check out the demo for the new default output (sorry, no options for formatting as of yet).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
eatiht-0.0.10.zip (9.1 kB) Copy SHA256 hash SHA256 Source None Dec 21, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page