Skip to main content

Heuristic based boilerplate removal tool

Project description

https://api.travis-ci.org/miso-belica/jusText.png?branch=master

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online.

This is a fork of original (currently unmaintained) code of jusText hosted on Google Code. Below are some alternatives that I found:

Installation

Make sure you have Python 2.6+/3.2+ and pip (Windows, Linux) installed. Run simply (preferred way):

$ [sudo] pip install justext

Or for the fresh version:

$ [sudo] pip install git+git://github.com/miso-belica/jusText.git

Or if you have to:

$ wget https://github.com/miso-belica/jusText/archive/master.zip # download the sources
$ unzip master.zip # extract the downloaded file
$ jusText-master/
$ [sudo] python setup.py install # install the package

Dependencies

lxml>=2.2.4

Usage

$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ python -m justext -s English -o plain_text.txt english_page.html
$ python -m justext --help # for more info

Python API

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
    print paragraph.text

Testing

Run tests via

$ nosetests-2.6 && nosetests-3.2 && nosetests-2.7 && nosetests-3.3

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to PhD research of Jan Pomikálek.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jusText-2.1.1.zip (864.5 kB view details)

Uploaded Source

Built Distribution

jusText-2.1.1-py2.py3-none-any.whl (861.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file jusText-2.1.1.zip.

File metadata

  • Download URL: jusText-2.1.1.zip
  • Upload date:
  • Size: 864.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for jusText-2.1.1.zip
Algorithm Hash digest
SHA256 1dd07067f415a87f22323a54d3d2238c9b0fe8e7945551a0763496ac56d30a16
MD5 d6c9247acd16c7c37ef0a2da3e2a2dad
BLAKE2b-256 046d31e00efd44497ffa0e433d5f8eca61216ad3feba4ee7b0dcb76ba9d75df4

See more details on using hashes here.

File details

Details for the file jusText-2.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for jusText-2.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5361f7d40568d6bdf2432aeba0bc11d1107db565e712bdb6624eec1f98e594ee
MD5 b3f02f7fe9e4881ef0427cefe393f2d3
BLAKE2b-256 151ea97c2cd1169e87e74e954e9b3946c6fc6b34e5178268d1cbabd6209ae289

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page