Skip to main content

Python port of Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

Project description

ripit (Forked from BoilerPy3 a beautiful library.)

PyPI - Version Updates

Original Boilerpy3 was not maintianed. I forked it to add some features changes. No changes to license

About

BoilerPy3 is a native Python port of Christian Kohlschütter's Boilerpipe library, released under the Apache 2.0 Licence.

This package is based on sammyer's BoilerPy, specifically mercuree's Python3-compatible fork. This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.

Note: This package is based on Boilerpipe 1.2 (at or before this commit), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.

Installation

To install the latest version from PyPI, execute:

pip install ripit

Usage

The top-level interfaces are the Extractors. Use the get_content() methods to extract the filtered text.

from ripit import extractors

extractor = extractors.ArticleExtractor()

# From a URL
content = extractor.get_content_from_url('http://www.example.com/')

# From a file
content = extractor.get_content_from_file('tests/test.html')

# From raw HTML
content = extractor.get_content('<html><body><h1>Example</h1></body></html>')

Alternatively, use get_doc() to return a Boilerpipe document from which you can get more detailed information.

from ripit import extractors

extractor = extractors.ArticleExtractor()

doc = extractor.get_doc_from_url('http://www.example.com/')
content = doc.content
title = doc.title

Extractors

DefaultExtractor

Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.

ArticleExtractor

A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.

ArticleSentencesExtractor

A full-text extractor which is tuned towards extracting sentences from news articles.

LargestContentExtractor

A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor

CanolaExtractor

A full-text extractor trained on krdwrd Canola. Works well with SimpleEstimator, too.

KeepEverythingExtractor

Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.

NumWordsRulesExtractor

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ripit-1.0.2.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ripit-1.0.2-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file ripit-1.0.2.tar.gz.

File metadata

  • Download URL: ripit-1.0.2.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for ripit-1.0.2.tar.gz
Algorithm Hash digest
SHA256 6b12e1acb7913fc57f1a545de18297fd24230ee01e2ea94caf4477a171bf9e98
MD5 f08a22232163ea0951cf06d23dd3cff8
BLAKE2b-256 59384ebda5eda1d8c26bf44ebc9335ae6c639c170349357c0d92b976e6887b0d

See more details on using hashes here.

File details

Details for the file ripit-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ripit-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for ripit-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ae0d8f91884d0ba1c676cf171ea75dd7494a77e58b1dfe7f1978327355fa144c
MD5 41c2c021a0bcb29209cf2a634c015f7f
BLAKE2b-256 311eeb7b67d3cdb025d6587f8cf40246a0754e1cbc009773878ca74978547de0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page