Python port of Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
This package is based on sammyer's BoilerPy, specifically mercuree's Python3-compatible fork. This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.
Note: This package is based on Boilerpipe 1.2 (at or before this commit), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.
To install the latest version from PyPI, execute:
pip install boilerpy3
If you'd like to try out any unreleased features you can install directly from GitHub like so:
pip install git+https://github.com/jmriebold/BoilerPy
The top-level interfaces are the Extractors. Use the
get_content() methods to extract the filtered text.
from boilerpy3 import extractors extractor = extractors.ArticleExtractor() # From a URL content = extractor.get_content_from_url('http://www.example.com/') # From a file content = extractor.get_content_from_file('tests/test.html') # From raw HTML content = extractor.get_content('<html><body><h1>Example</h1></body></html>')
get_doc() to return a Boilerpipe document from which you can get more detailed information.
from boilerpy3 import extractors extractor = extractors.ArticleExtractor() doc = extractor.get_doc_from_url('http://www.example.com/') content = doc.content title = doc.title
Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.
A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.
A full-text extractor which is tuned towards extracting sentences from news articles.
A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor
Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for boilerpy3-1.0.3-py3-none-any.whl