Skip to main content

Python port of Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

Project description

BoilerPy3

build

About

BoilerPy3 is a native Python port of Christian Kohlschütter's Boilerpipe library, released under the Apache 2.0 Licence.

This package is based on sammyer's BoilerPy, specifically mercuree's Python3-compatible fork. This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.

Note: This package is based on Boilerpipe 1.2 (at or before this commit), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.

Installation

To install the latest version from PyPI, execute:

pip install boilerpy3

If you'd like to try out any unreleased features you can install directly from GitHub like so:

pip install git+https://github.com/jmriebold/BoilerPy

Usage

The top-level interfaces are the Extractors. Use the get_content() methods to extract the filtered text.

from boilerpy3 import extractors

extractor = extractors.ArticleExtractor()

# From a URL
content = extractor.get_content_from_url('http://www.example.com/')

# From a file
content = extractor.get_content_from_file('tests/test.html')

# From raw HTML
content = extractor.get_content('<html><body><h1>Example</h1></body></html>')

Alternatively, use get_doc() to return a Boilerpipe document from which you can get more detailed information.

from boilerpy3 import extractors

extractor = extractors.ArticleExtractor()

doc = extractor.get_doc_from_url('http://www.example.com/')
content = doc.content
title = doc.title

Extractors

DefaultExtractor

Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.

ArticleExtractor

A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.

ArticleSentencesExtractor

A full-text extractor which is tuned towards extracting sentences from news articles.

LargestContentExtractor

A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor

CanolaExtractor

A full-text extractor trained on krdwrd Canola. Works well with SimpleEstimator, too.

KeepEverythingExtractor

Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.

NumWordsRulesExtractor

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boilerpy3-1.0.3.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

boilerpy3-1.0.3-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file boilerpy3-1.0.3.tar.gz.

File metadata

  • Download URL: boilerpy3-1.0.3.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for boilerpy3-1.0.3.tar.gz
Algorithm Hash digest
SHA256 bb054aa6e722fe306183bbaae87411521bd8646a547de642c0d2acbe021e3192
MD5 6fa76666164a462543133adc678dc0d8
BLAKE2b-256 2ffa09f79c46e05b39aedfd92660021171fc49ea53f8e188e0893645ac8593f4

See more details on using hashes here.

File details

Details for the file boilerpy3-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: boilerpy3-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.53.0 CPython/3.8.6

File hashes

Hashes for boilerpy3-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9190a23b74f61e36c5dcd30714f72425af7c5a2ce361d64672aa0c453a538d61
MD5 8fe7c086f3e7b74610790bf775eb3947
BLAKE2b-256 b563a72d2c01d87cb81c5736d61e5dd952519f276590beb3050f8d46f8446dca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page