Python port of Boilerpipe, for HTML boilerplate removal and text extraction

These details have not been verified by PyPI

Project links

Homepage

Project description

BoilerPy3

build

About

BoilerPy3 is a native Python port of Christian Kohlschütter's Boilerpipe library, released under the Apache 2.0 Licence.

This package is based on sammyer's BoilerPy, specifically mercuree's Python3-compatible fork. This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.

Note: This package is based on Boilerpipe 1.2 (at or before this commit), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.

Installation

To install the latest version from PyPI, execute:

pip install boilerpy3

If you'd like to try out any unreleased features you can install directly from GitHub like so:

pip install git+https://github.com/jmriebold/BoilerPy3

Usage

Text Extraction

The top-level interfaces are the Extractors. Use the get_content() methods to extract the filtered text.

from boilerpy3 import extractors

extractor = extractors.ArticleExtractor()

# From a URL
content = extractor.get_content_from_url('http://example.com/')

# From a file
content = extractor.get_content_from_file('tests/test.html')

# From raw HTML
content = extractor.get_content('<html><body><h1>Example</h1></body></html>')

Marked HTML Extraction

To extract the HTML chunks containing filtered text, use the get_marked_html() methods.

from boilerpy3 import extractors

extractor = extractors.ArticleExtractor()

# From a URL
content = extractor.get_marked_html_from_url('http://example.com/')

# From a file
content = extractor.get_marked_html_from_file('tests/test.html')

# From raw HTML
content = extractor.get_marked_html('<html><body><h1>Example</h1></body></html>')

Other

Alternatively, use get_doc() to return a Boilerpipe document from which you can get more detailed information.

from boilerpy3 import extractors

extractor = extractors.ArticleExtractor()

doc = extractor.get_doc_from_url('http://example.com/')
content = doc.content
title = doc.title

Extractors

All extractors have a raise_on_failure parameter (defaults to True). When set to False, the Extractor will handle exceptions raised during text extraction and return any text that was successfully extracted. Leaving this at the default setting may be useful if you want to fall back to another algorithm in the event of an error.

DefaultExtractor

Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.

ArticleExtractor

A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.

ArticleSentencesExtractor

A full-text extractor which is tuned towards extracting sentences from news articles.

LargestContentExtractor

A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor

CanolaExtractor

A full-text extractor trained on krdwrd Canola. Works well with SimpleEstimator, too.

KeepEverythingExtractor

Dummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.

NumWordsRulesExtractor

A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).

Notes

Getting Content from URLs

While BoilerPy3 provides extractor.*_from_url() methods as a convenience, these are intended for testing only. For more robust functionality, in addition to full control over the request itself, it is strongly recommended to use the Requests package instead, calling extractor.get_content() with the resulting HTML.

import requests
from boilerpy3 import extractors

extractor = extractors.ArticleExtractor()

# Make request to URL
resp = requests.get('http://example.com/')

# Pass HTML to Extractor
content = extractor.get_content(resp.text)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.7

Nov 1, 2023

1.0.6

Feb 22, 2022

1.0.5

Aug 2, 2021

1.0.4

Feb 4, 2021

1.0.3

Nov 21, 2020

1.0.2

Dec 22, 2019

1.0.1

Aug 9, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boilerpy3-1.0.7.tar.gz (22.2 kB view details)

Uploaded Nov 1, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

boilerpy3-1.0.7-py3-none-any.whl (23.0 kB view details)

Uploaded Nov 1, 2023 Python 3

File details

Details for the file boilerpy3-1.0.7.tar.gz.

File metadata

Download URL: boilerpy3-1.0.7.tar.gz
Upload date: Nov 1, 2023
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for boilerpy3-1.0.7.tar.gz
Algorithm	Hash digest
SHA256	`a9fede212f80a36dbc7d4f93e35d8636911cb6b37085a3230557d16ad0f076c8`
MD5	`79c99a46fc9d20fd6837ac27877890a7`
BLAKE2b-256	`a9782ff130662bc53491a5c517673dfe4e5999a44bc46bf372f24a5a71a0e8ca`

See more details on using hashes here.

File details

Details for the file boilerpy3-1.0.7-py3-none-any.whl.

File metadata

Download URL: boilerpy3-1.0.7-py3-none-any.whl
Upload date: Nov 1, 2023
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for boilerpy3-1.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fbfba91745606965400204d26852283ddf90235ab30afe9904de20051556a523`
MD5	`aff487fd8fac501a8d0154df4883693f`
BLAKE2b-256	`d9b1e376edbdc1f1755fdb6cb1f6173b2a7afa8a6d766f7d10e34e7db0c18510`

See more details on using hashes here.

boilerpy3 1.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BoilerPy3

About

Installation

Usage

Text Extraction

Marked HTML Extraction

Other

Extractors

DefaultExtractor

ArticleExtractor

ArticleSentencesExtractor

LargestContentExtractor

CanolaExtractor

KeepEverythingExtractor

NumWordsRulesExtractor

Notes

Getting Content from URLs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes