webarticle2text

Extracts the main article text from a webpage.

These details have not been verified by PyPI

Project links

Homepage

Project description

# Webarticle2Text - Extracts the main article text from a webpage.

[![](https://img.shields.io/pypi/v/webarticle2text.svg)](https://pypi.python.org/pypi/webarticle2text) [![Build Status](https://img.shields.io/travis/chrisspen/webarticle2text.svg?branch=master)](https://travis-ci.org/chrisspen/webarticle2text) [![](https://pyup.io/repos/github/chrisspen/webarticle2text/shield.svg)](https://pyup.io/repos/github/chrisspen/webarticle2text)

## Overview

This project is obsolete and now only serves as a reference. I recommend you instead use [newspaper](https://github.com/codelucas/newspaper), which is an order-of-magnitude more accurate than any other article extraction library I’ve encountered.

Please see compare.csv for a performance comparison of several similar tools.

This attempts to locate and extract the largest cluster of text in a webpage. It does this by walking the DOM-tree, identifying all text segments and their depth inside the DOM, appends all text at roughly the same depth, and then returns the chunk with the largest total length.

This approach usually works well with typical news sites where one news article is displayed per URL. This approach usually fails with URLs displaying multiple news blurbs (e.g. news aggregators).

## Installation

You may need to install the tidylib system package, which you can get on Ubuntu 12.04 using:

sudo apt-get install libtidy-0.99-0

or on Fedora using:

sudo yum install libtidy

Then, simply install the package using pip:

pip install webarticle2text

## Usage

You can invoke the script either as a Python module:

from webarticle2text import webarticle2text print webarticle2text.extractFromURL(”http://some/arbitrary/url”)

or as a standalone command line script:

webarticle2text.py http://some/arbitrary/url

Note, to use it from the command line, you’ll need to ensure it has execute permission and is located in your PATH. On most platforms, this should automatically be done by setup.py.

## Development

Tests require the Python development headers to be installed, which you can install on Ubuntu with:

sudo apt-get install python-dev python3-dev python3.4-dev

To run unittests across multiple Python versions, install:

sudo apt-get install python3.4-minimal python3.4-dev python3.5-minimal python3.5-dev

To run all [tests](http://tox.readthedocs.org/en/latest/):

export TESTNAME=; tox

To run tests for a specific environment (e.g. Python 2.7):

export TESTNAME=; tox -e py27

To run a specific test:

export TESTNAME=.test_extract; tox -e py27

## History

1.0.0 (2008.9.16) Initial public release.
1.2.0 (2011.1.3) Update to support Unicode.
1.2.2 (2011.12.17) Cleaned up installation procedure and documentation and moved to github.com.
1.2.3 (2011.12.21) Fixed encoding error when redirecting stdout. e.g. webarticle2text.py http://some/arbitrary/url > output.txt
1.2.5 (2012.11.5) Added the option to specify user-agent header to use when requesting URLs.
2.0.0 (2014.4.20) Added support for Python 3.2.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

3.0.2

Jun 28, 2017

2.0.3

Nov 14, 2014

2.0.2

Sep 9, 2014

2.0.1

Jul 28, 2014

1.2.5

May 22, 2013

1.2.4

Sep 30, 2012

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webarticle2text-3.0.2.tar.gz (658.8 kB view details)

Uploaded Jun 28, 2017 Source

File details

Details for the file webarticle2text-3.0.2.tar.gz.

File metadata

Download URL: webarticle2text-3.0.2.tar.gz
Upload date: Jun 28, 2017
Size: 658.8 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for webarticle2text-3.0.2.tar.gz
Algorithm	Hash digest
SHA256	`032bdb1f53c8558006c44fb0ec23349056aa69697ac371394b16a8b7cfb381ad`
MD5	`1d8e5b8615751d862a7a77cbb96969a3`
BLAKE2b-256	`fa0bc970f74c22879fc40ab87ef8fef38b0c166962e6031d82cf3a8e997dca44`

See more details on using hashes here.

webarticle2text 3.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

## Overview

## Installation

## Usage

## History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes