Extract the main article content (and optionally comments) from a web page

These details have not been verified by PyPI

Project links

Homepage

Project description

ExtractNet

Based on the popular content extraction package Dragnet, ExtractNet extend the machine learning approach to extract other attributes such as date, author and keywords from news article.

demo code

Example code:

Simply use the following command to install the latest released version:

pip install extractnet

Start extract content and other meta data passing the result html to function

import requests
from extractnet import Extractor

raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text
results = Extractor().extract(raw_html)

Why don't just use existing rule-base extraction method:

We discover some webpage doesn't provide the real author name but simply populate the author tag with a default value.

For example ltn.com.tw, udn.com always populate the same author value for each news article while the real author can only be found within the content.

Our machine learnig first approach extract correct fields just like human reading a website

ExtractNet uses machine learning approach to extract these relevant data through visible section of the webpage just like a human.

ExtractNet pipeline

What ExtractNet is and isn't

ExtractNet is a platform to extract any interesting attributes from any webpage, not just limited to content based article.
The core of ExtractNet aims to convert unstructured webpage to structured data without relying hand crafted rules
ExtractNet do not support boilerplate content extraction
ExtractNet allows user to add custom pipelines that returns additional data through a list of callbacks function

Performance

Results of the body extraction evaluation:

We use the same body extraction benchmark from article-extraction-benchmark

Model	Precision	Recall	F1	Accuracy	Open Source
AutoExtract	0.984 ± 0.003	0.956 ± 0.010	0.970 ± 0.005	0.470 ± 0.037	✗
Diffbot	0.958 ± 0.009	0.944 ± 0.013	0.951 ± 0.010	0.348 ± 0.035	✗
ExtractNet	0.922 ± 0.011	0.933 ± 0.013	0.927 ± 0.010	0.160 ± 0.027	✔
boilerpipe	0.850 ± 0.016	0.870 ± 0.020	0.860 ± 0.016	0.006 ± 0.006	✔
dragnet	0.925 ± 0.012	0.889 ± 0.018	0.907 ± 0.014	0.221 ± 0.030	✔
html-text	0.500 ± 0.017	0.994 ± 0.001	0.665 ± 0.015	0.000 ± 0.000	✔
newspaper	0.917 ± 0.013	0.906 ± 0.017	0.912 ± 0.014	0.260 ± 0.032	✔
readability	0.913 ± 0.014	0.931 ± 0.015	0.922 ± 0.013	0.315 ± 0.034	✔
trafilatura	0.930 ± 0.010	0.967 ± 0.009	0.948 ± 0.008	0.243 ± 0.031	✔

Results of author name extraction:

Model	F1
ExtractNet : fasttext embeddings + CRF	0.904 ± 0.10

List of changes from Dragnet

Underlying classifier is replaced by Catboost instead of Decision Tree for all attributes extraction for consistency and performance boost.
Updated CSS features, added text+css latent feature
Includes a CRF model that extract names from author block text.
Trained on 22000+ updated webpages collected in the late 2020, 20 times of dragnet data.

GETTING STARTED

Installing and extraction

pip install extractnet

import requests
from extractnet import Extractor

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = Extractor().extract(raw_html)
for key, value in results.items():
    print(key)
    print(value)
    print('------------')

Callbacks

ExtractNet also support the ability to add callbacks functions to inject additional features during extraction process

A quick glance of usage : each callbacks will be able to access the raw html string provided during the extraction process. This allows user to extract addtional information such as language detection to the final results

def meta_pre1(raw_html):
    return {'first_value': 0}

def meta_pre2(raw_html):
    return {'first_value': 1, 'second_value': 2}

def find_stock_ticker(raw_html, results):
    matched_ticker = []
    for ticket in re.findall(r'[$][A-Za-z][\S]*', str(results['content'])):
      matched_ticker.append(ticket)
    return {'matched_ticker': matched_ticker}

extract = Extractor(author_prob_threshold=0.1, 
      meta_postprocess=[meta_pre1, meta_pre2], 
      postprocess=[find_stock_ticker])

The extracted results will contain like, first_value and second_value. Do note callbacks are executed by the given order ( which means meta_pre1 will be executed first followed by meta_pre2 ), any results passed from the previous stage will not be overwritten by later stage

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = extract(raw_html)

In this example the value for first_value will remain 0 even though meta_pre2 also returns first_value=1 because meta_pre2 callbacks already assign first_value as 0.

Contributing

We love contributions! Open an issue, or fork/create a pull request.

Develop Locally

Since extractnet relies on several C++ modules, before starting to run locally you need to compile them first

Usually what you need would be this command

make

However, you can try to build it

Supress logging error

Setting the level to critical will suppress any logging output

from extractnet import Extractor
from extractnet.blocks import BlockifyError
logging.getLogger('extractnet').setLevel(logging.CRITICAL)

extractor = Extractor()

More details about the code structure

Coming soon

Reference

Content extraction using diverse feature sets

[1] Peters, Matthew E. and D. Lecocq, Content extraction using diverse feature sets

@inproceedings{Peters2013ContentEU,
  title={Content extraction using diverse feature sets},
  author={Matthew E. Peters and D. Lecocq},
  booktitle={WWW '13 Companion},
  year={2013}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.0.7

Nov 6, 2022

2.0.6

Oct 30, 2022

2.0.4

Apr 27, 2022

2.0.3

Apr 27, 2022

1.0.4

Feb 9, 2021

1.0.3

Jan 1, 2021

1.0.2

Dec 17, 2020

1.0.0

Dec 10, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractnet-2.0.7.tar.gz (1.8 MB view details)

Uploaded Nov 6, 2022 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extractnet-2.0.7-cp310-cp310-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded Nov 6, 2022 CPython 3.10manylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp310-cp310-macosx_11_0_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.10macOS 11.0+ x86-64

extractnet-2.0.7-cp310-cp310-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.10macOS 10.15+ x86-64

extractnet-2.0.7-cp39-cp39-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded Nov 6, 2022 CPython 3.9manylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp39-cp39-macosx_12_0_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.9macOS 12.0+ x86-64

extractnet-2.0.7-cp38-cp38-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded Nov 6, 2022 CPython 3.8manylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp38-cp38-macosx_10_16_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.8macOS 10.16+ x86-64

extractnet-2.0.7-cp38-cp38-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.8macOS 10.15+ x86-64

extractnet-2.0.7-cp37-cp37m-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded Nov 6, 2022 CPython 3.7mmanylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp37-cp37m-macosx_10_16_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.7mmacOS 10.16+ x86-64

extractnet-2.0.7-cp37-cp37m-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.7mmacOS 10.15+ x86-64

extractnet-2.0.7-cp36-cp36m-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded Nov 6, 2022 CPython 3.6mmanylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp36-cp36m-macosx_10_16_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.6mmacOS 10.16+ x86-64

extractnet-2.0.7-cp36-cp36m-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded Nov 6, 2022 CPython 3.6mmacOS 10.15+ x86-64

File details

Details for the file extractnet-2.0.7.tar.gz.

File metadata

Download URL: extractnet-2.0.7.tar.gz
Upload date: Nov 6, 2022
Size: 1.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for extractnet-2.0.7.tar.gz
Algorithm	Hash digest
SHA256	`0ef0ea022db2a479cdf377be7f13a88d0376dbf1f87af735749bf8319118127c`
MD5	`c3c7c6ac7cad69de6281dfd408421c1c`
BLAKE2b-256	`f424349250dc5c7bcf8ee98d72bedbb285ea47d95ac8c56c5df324f7e931b0b4`

Algorithm	Hash digest
SHA256	`f84f6574d2035162ed90f8d14746cba463eeebc44f757162089864176a810a29`
MD5	`f41048083a48ea0529c23f22e11c1a11`
BLAKE2b-256	`988d2b05722552dd3a60c0e01e9456db6ea03b0037794308bad4d27905ae9c29`

Algorithm	Hash digest
SHA256	`829e54cf267411e5c75a49b1bda490f1ad6c766a5f4afad8061c2e5a4547d548`
MD5	`b66eca72549bab544dc6d9d40d84131f`
BLAKE2b-256	`4a493665cbfaa75eca8058625ce718ffd50f2805f8b8b3751c32499f1d0ea3bc`

Algorithm	Hash digest
SHA256	`392fdfca5035570dbb4f857e286043994e8ffa574435efe4f515931df19ad5f7`
MD5	`2be2b039a382d6de39b18ad5c9b38d5e`
BLAKE2b-256	`f5c872a4b99e07f33967a2f371d5a99e3ca251a5f78a53a499a58071a23507f8`

Algorithm	Hash digest
SHA256	`0203dde75d72cc8516453f75b9c038b0194e0d861b99ba5b97078de4ffdd97de`
MD5	`75de0dccf779607bc1b2e5dedf38cd97`
BLAKE2b-256	`c2a6f289671c3344dcb4ac783fede3630b3488543ec1f5c013d0d16e1623ebd0`

Algorithm	Hash digest
SHA256	`5d2aa539ab2a5fa8c85bdc52745b49736aafd7ac97912569aefbd3625db92789`
MD5	`ace35914648b367f800f530c1fa1e530`
BLAKE2b-256	`3c577fc0e59be3a39b0c969c872d9bd0daae1d33a17f3ff1119b6176a46de0ee`

Algorithm	Hash digest
SHA256	`48ac8d265cf1766e6a861d66073783729809afeea0b5db0f7dbbcc907299f8ea`
MD5	`f18a202858433dfa0a56077836924146`
BLAKE2b-256	`97e38ff0934c9f73e12dd6fd7fa4c949b1fe009af24d66b241dddb6fc6c217dd`

Algorithm	Hash digest
SHA256	`e18cc0d977898c09dae2058d1f4a7c83d529bfb8265d24b2177d1ad766ec17ef`
MD5	`8d359b69e311ca52df379c9c48607aa3`
BLAKE2b-256	`bb6248458b320a5a025fc389a70e88855a796b953b7b331c087d35450e4d44cc`

Algorithm	Hash digest
SHA256	`e1047d8e1e04962fb539831d15da66fa1ebddc879e3dc04247e25c6e0473fdff`
MD5	`0700137fbfe89d93d458512b61cd9008`
BLAKE2b-256	`84d5b13a92721a32556610b96d87ef9475bec4996970a942d79368e1294b864e`

Algorithm	Hash digest
SHA256	`57ccaf46e58eeee1042008c70a0819aa0435de66d9790b76a32dc6aceaf5c984`
MD5	`0ccd6568c3f3400e87244244c5ab312e`
BLAKE2b-256	`6f79c97e5882bff81068302db6c6444f28246d59845578ccb4954c3097e00299`

Algorithm	Hash digest
SHA256	`4c9319cab35bda1fa8be1e9a29365a5bd3e77f0743d5aaefde665f944e716701`
MD5	`3531c91b33d89f5ca72c3a05e24ce76d`
BLAKE2b-256	`2195a2baf0d84561ec5689cf57050b969793d9273586ac9af1dbf8949ba207ac`

Algorithm	Hash digest
SHA256	`8a08106a62c4098392a53c65474ec77c6f6c20befc2fbb3451657007420d0790`
MD5	`532ea22f885b8d7d52f9fcd38b6639e4`
BLAKE2b-256	`f9d4a6d76c380165f4bcc92bf2fec9d5c29f0edc6cc0af7ac32b758d02561dad`

Algorithm	Hash digest
SHA256	`bee36ccf1db343a068c1598380514e6f59e3c73df2b144f4502b59dcf803e209`
MD5	`2f2d4356ad390915a58fd75be9a030e7`
BLAKE2b-256	`9197eddc838609f6231d92f7e18528fd2a425effa90a775d739ec62e624317dd`

Algorithm	Hash digest
SHA256	`45b108889290187f13d6f79054b9dd4eac80796bbd00c3ecd68f0f2ad1fa7f1c`
MD5	`21ad641746d8b7f8d88bae3ae3b9bb84`
BLAKE2b-256	`be6a5c3234bebc60fa5341c7fa8cf52c4eb5cbbb6a90baccdcf9436ddde76cbf`

Algorithm	Hash digest
SHA256	`3275b03f34db1d3813a8b5b4128bd9ed61546e84ff9f2b1fb1f45393b26a324a`
MD5	`85855bc7daebe8ce0b84a5c66f6362b4`
BLAKE2b-256	`3ac0e749891422ea5edc52d3ffa17d15f7de8a777bf100831cacdfc75c6c9033`

extractnet 2.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ExtractNet

Why don't just use existing rule-base extraction method:

What ExtractNet is and isn't

Performance

List of changes from Dragnet

GETTING STARTED

Installing and extraction

Callbacks

Contributing

Develop Locally

Supress logging error

More details about the code structure

Reference

Content extraction using diverse feature sets

Bag of Tricks for Efficient Text Classification

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes