eatiht

A featherweight tool used to extract an article's text in html documents.

These details have not been verified by PyPI

Project links

Homepage

Project description

A python package for extracting article text in html documents. Check out this demo.

12/26/14 Update

New algorithm, please skip to eatiht’s usage for details.

Please refer to the issues for notes on possible bugs, improvements, etc.

Check out eatiht’s new website where I walk through each step in the original algorithm! It’s virtually pain-free. New writeup will be coming soon!

What people have been saying

You should write a paper on this work - /u/queue_cumber

This is neat-o. A short and sweet project… - /u/CandyCorns_

From a quick glance this looks super elegant! Very neat idea! - /u/worldsayshi

At a Glance

To install:

pip install eatiht
...
easy_install eatiht

Note: On Windows, you may need to install lxml manually using: pip install lxml

Using in Python

Currently, there are two new submodules: * eatiht_v2.py * etv2.py

eatiht_v2 is functionally identical to the original eatiht

import eatiht_v2 as eatiht

url = 'http://www.washingtonpost.com/blogs/the-switch/wp/2014/12/26/elon-musk-the-new-tesla-roadster-can-travel-some-400-miles-on-a-single-charge/'

print eatiht.extract(url)

Output:

Car nerds, you just got an extra present under the tree.

Tesla announced Friday an upgrade for its Roadster, the electric car company’s convertible model, and said that the new features significantly boost its range -- beyond what many traditional cars can get on a tank of gasoline.

eatiht_v2 contains one extra function that executes the extraction algorithm, but along with outputting the text, it outputs the structures that were used to calculate the output (ie. histogram, list of xpaths, etc.):

results = eatiht.extract_more(url)

results[0]      # extracted text
results[1]      # frequency distribution (histogram)
results[2]      # subtrees (list of textnodes pre-filter)
results[3]      # pruned subtrees
results[4]      # list of paragraphs (as seperated in original website)

Now whether or not this little extra function looks messy is up to debate - I think it looks messy and difficult to remember which index leads to what.

So to properly encapsulate those stuctures, there are new classes that will make accessing those properties simpler:

import etv2

url = "..."

tree = etv2.extract(url)

print tree.fulltext

Output:

Car nerds, you just got an extra present under the tree.

Tesla announced Friday an upgrade for its Roadster, the electric car company’s...

There are currently no public methods, only the structures present in the extract_more:

print tree.histogram

Output:

[('/html/body/div[2]/div[5]/div[1]/div[1]/div/article', 8),
 ('/html/body/div[2]/div[5]/div[1]/div[6]/div/div[2]/div[2]/div[6]', 1),
 ('/html/body/div[2]/div[5]/div[2]/div[2]/div/ul/li[3]/a', 1),
 ...]

Please refer to eatiht_trees.py for more info on what properties are available.

As of now, a feature that should be on its way is the ability to not only have the extracted text, but also the original, immediately surounding html. This may help with keeping a persistant look. This is a top priority.

And of course, there is the original:

# from initial release
import eatiht

url = 'http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html'

print eatiht.extract(url)

Output

NASA's Curiosity rover is continuing to help scientists piece together the mystery of how Mars lost its surface water over the course of billions of years. The rover drilled into a piece of Martian rock called Cumberland and found some ancient water hidden within it...

Using as a command line tool:

eatiht http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html >> out.txt

Note: Window’s users may have to add the C:directory to your “path” so that the command line tool works from any directory, not only the ..directory.

Requirements

requests
lxml

Motivation

After searching through the deepest crevices of the internet for some tool|library|module that could effectively extract the main content from a website (ignoring text from ads, sidebar links, etc.), I was slightly disheartened by the apparent ambiguity caused by this content-extraction problem.

My survey resulted in some of the following solutions:

boilerpipe - Boilerplate Removal and Fulltext Extraction from HTML pages. Java library written by Christian Kohlschütter
“The Easy Way to Extract Useful Text from Arbitrary HTML” - a Python tutorial on implementing a neural network for html content extraction. Written by alexjc
Pyteaser’s Cleaners module - from what I can tell, it’s a purely heuristic-based process
“Text Extraction from the Web via Text-to-Tag Ratio” - a thesis on Text-to-Tag-heuristic driven clustering as a solution for the problem at hand. Written by Tim Weninger & William H. Hsu

The number of research papers I found on the subject largely outweighs the number available open-source projects. This is my attempt at balancing out the disparity.

In the process of coming up with a solution, I made two unoriginal observations:

XPath’s select all (//), parent node (..) queries and functions (‘string-length’) are remarkably powerful when used together
Unnecessary machine learning is unnecessary

By making an assumption on sentence length, and this is trivial, one can query for text-nodes satisfying said sentence length, then create a frequency distribution (histogram) across the parent-nodes, and the argmax of the resulting distribution is the xpath that is shared amongst likely sentences.

The results were surprisingly good. I personally prefer this approach to the others as it seems to lie somewhere in between the purely rule-based and the drowning-in-ML approaches.

Issues or Contact

Please raise any issues or yell at me at rodrigopala91@gmail.com or [@rodricios](https://twitter.com/rodricios)

Tests

Currently, the tests are lacking. But please still run these tests to ensure that modifications to eatiht.py and eatiht_v2.py run properly.

python setup.py test

TODO:

HTML-and-text extraction
etv2.py tests

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.14

Mar 28, 2015

0.1.13

Mar 13, 2015

0.1.12

Dec 31, 2014

0.1.11

Dec 28, 2014

0.1.1

Dec 27, 2014

This version

0.1.0

Dec 27, 2014

0.0.10

Dec 21, 2014

0.0.9

Dec 19, 2014

0.0.8

Dec 19, 2014

0.0.7

Dec 19, 2014

0.0.6

Dec 18, 2014

0.0.5

Dec 18, 2014

0.0.4

Dec 18, 2014

0.0.3

Dec 18, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eatiht-0.1.0.zip (135.1 kB view details)

Uploaded Dec 27, 2014 Source

File details

Details for the file eatiht-0.1.0.zip.

File metadata

Download URL: eatiht-0.1.0.zip
Upload date: Dec 27, 2014
Size: 135.1 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for eatiht-0.1.0.zip
Algorithm	Hash digest
SHA256	`9008e899137aec2f7be5de5397322322b4707220d3dc8a0456311c18accdde05`
MD5	`353da498e7f19da4d3d60bed152fe48a`
BLAKE2b-256	`36b562883675689e2313f45b0eed6a3477da8b30b2466634515e8a7accfd4000`

See more details on using hashes here.

eatiht 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

12/26/14 Update

What people have been saying

At a Glance

To install:

Using in Python

Output

Using as a command line tool:

Requirements

Motivation

Issues or Contact

Tests

TODO:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes