Skip to main content

Extract the main article content (and optionally comments) from a web page

Project description

ExtractNet

PyPI version codecov

Based on the popular content extraction package Dragnet, ExtractNet extend the machine learning approach to extract other attributes such as date, author and keywords from news article.

demo code

Example code:

Simply use the following command to install the latest released version:

pip install extractnet

Start extract content and other meta data passing the result html to function

import requests
from extractnet import Extractor

raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text
results = Extractor().extract(raw_html)

Why don't just use existing rule-base extraction method:

We discover some webpage doesn't provide the real author name but simply populate the author tag with a default value.

For example ltn.com.tw, udn.com always populate the same author value for each news article while the real author can only be found within the content.

Our machine learnig first approach extract correct fields just like human reading a website

ExtractNet uses machine learning approach to extract these relevant data through visible section of the webpage just like a human.

ExtractNet pipeline

What ExtractNet is and isn't

  • ExtractNet is a platform to extract any interesting attributes from any webpage, not just limited to content based article.

  • The core of ExtractNet aims to convert unstructured webpage to structured data without relying hand crafted rules

  • ExtractNet do not support boilerplate content extraction

  • ExtractNet allows user to add custom pipelines that returns additional data through a list of callbacks function


Performance

Results of the body extraction evaluation:

We use the same body extraction benchmark from article-extraction-benchmark

Model Precision Recall F1 Accuracy Open Source
AutoExtract 0.984 ± 0.003 0.956 ± 0.010 0.970 ± 0.005 0.470 ± 0.037
Diffbot 0.958 ± 0.009 0.944 ± 0.013 0.951 ± 0.010 0.348 ± 0.035
ExtractNet 0.922 ± 0.011 0.933 ± 0.013 0.927 ± 0.010 0.160 ± 0.027
boilerpipe 0.850 ± 0.016 0.870 ± 0.020 0.860 ± 0.016 0.006 ± 0.006
dragnet 0.925 ± 0.012 0.889 ± 0.018 0.907 ± 0.014 0.221 ± 0.030
html-text 0.500 ± 0.017 0.994 ± 0.001 0.665 ± 0.015 0.000 ± 0.000
newspaper 0.917 ± 0.013 0.906 ± 0.017 0.912 ± 0.014 0.260 ± 0.032
readability 0.913 ± 0.014 0.931 ± 0.015 0.922 ± 0.013 0.315 ± 0.034
trafilatura 0.930 ± 0.010 0.967 ± 0.009 0.948 ± 0.008 0.243 ± 0.031

Results of author name extraction:
Model F1
ExtractNet : fasttext embeddings + CRF 0.904 ± 0.10

List of changes from Dragnet

  • Underlying classifier is replaced by Catboost instead of Decision Tree for all attributes extraction for consistency and performance boost.

  • Updated CSS features, added text+css latent feature

  • Includes a CRF model that extract names from author block text.

  • Trained on 22000+ updated webpages collected in the late 2020, 20 times of dragnet data.

GETTING STARTED

Installing and extraction

pip install extractnet
import requests
from extractnet import Extractor

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = Extractor().extract(raw_html)
for key, value in results.items():
    print(key)
    print(value)
    print('------------')

Callbacks

ExtractNet also support the ability to add callbacks functions to inject additional features during extraction process

A quick glance of usage : each callbacks will be able to access the raw html string provided during the extraction process. This allows user to extract addtional information such as language detection to the final results

def meta_pre1(raw_html):
    return {'first_value': 0}

def meta_pre2(raw_html):
    return {'first_value': 1, 'second_value': 2}

def find_stock_ticker(raw_html, results):
    matched_ticker = []
    for ticket in re.findall(r'[$][A-Za-z][\S]*', str(results['content'])):
      matched_ticker.append(ticket)
    return {'matched_ticker': matched_ticker}

extract = Extractor(author_prob_threshold=0.1, 
      meta_postprocess=[meta_pre1, meta_pre2], 
      postprocess=[find_stock_ticker])

The extracted results will contain like, first_value and second_value. Do note callbacks are executed by the given order ( which means meta_pre1 will be executed first followed by meta_pre2 ), any results passed from the previous stage will not be overwritten by later stage

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = extract(raw_html)

In this example the value for first_value will remain 0 even though meta_pre2 also returns first_value=1 because meta_pre2 callbacks already assign first_value as 0.

Contributing

We love contributions! Open an issue, or fork/create a pull request.

Develop Locally

Since extractnet relies on several C++ modules, before starting to run locally you need to compile them first

Usually what you need would be this command

make

However, you can try to build it

Supress logging error

Setting the level to critical will suppress any logging output

from extractnet import Extractor
from extractnet.blocks import BlockifyError
logging.getLogger('extractnet').setLevel(logging.CRITICAL)

extractor = Extractor()

More details about the code structure

Coming soon

Reference

Content extraction using diverse feature sets

[1] Peters, Matthew E. and D. Lecocq, Content extraction using diverse feature sets

@inproceedings{Peters2013ContentEU,
  title={Content extraction using diverse feature sets},
  author={Matthew E. Peters and D. Lecocq},
  booktitle={WWW '13 Companion},
  year={2013}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractnet-2.0.7.tar.gz (1.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

extractnet-2.0.7-cp310-cp310-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp310-cp310-macosx_11_0_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10macOS 11.0+ x86-64

extractnet-2.0.7-cp310-cp310-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10macOS 10.15+ x86-64

extractnet-2.0.7-cp39-cp39-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp39-cp39-macosx_12_0_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9macOS 12.0+ x86-64

extractnet-2.0.7-cp38-cp38-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp38-cp38-macosx_10_16_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8macOS 10.16+ x86-64

extractnet-2.0.7-cp38-cp38-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

extractnet-2.0.7-cp37-cp37m-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp37-cp37m-macosx_10_16_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7mmacOS 10.16+ x86-64

extractnet-2.0.7-cp37-cp37m-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7mmacOS 10.15+ x86-64

extractnet-2.0.7-cp36-cp36m-manylinux_2_24_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.24+ x86-64

extractnet-2.0.7-cp36-cp36m-macosx_10_16_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6mmacOS 10.16+ x86-64

extractnet-2.0.7-cp36-cp36m-macosx_10_15_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6mmacOS 10.15+ x86-64

File details

Details for the file extractnet-2.0.7.tar.gz.

File metadata

  • Download URL: extractnet-2.0.7.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for extractnet-2.0.7.tar.gz
Algorithm Hash digest
SHA256 0ef0ea022db2a479cdf377be7f13a88d0376dbf1f87af735749bf8319118127c
MD5 c3c7c6ac7cad69de6281dfd408421c1c
BLAKE2b-256 f424349250dc5c7bcf8ee98d72bedbb285ea47d95ac8c56c5df324f7e931b0b4

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp310-cp310-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp310-cp310-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 f84f6574d2035162ed90f8d14746cba463eeebc44f757162089864176a810a29
MD5 f41048083a48ea0529c23f22e11c1a11
BLAKE2b-256 988d2b05722552dd3a60c0e01e9456db6ea03b0037794308bad4d27905ae9c29

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 829e54cf267411e5c75a49b1bda490f1ad6c766a5f4afad8061c2e5a4547d548
MD5 b66eca72549bab544dc6d9d40d84131f
BLAKE2b-256 4a493665cbfaa75eca8058625ce718ffd50f2805f8b8b3751c32499f1d0ea3bc

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 392fdfca5035570dbb4f857e286043994e8ffa574435efe4f515931df19ad5f7
MD5 2be2b039a382d6de39b18ad5c9b38d5e
BLAKE2b-256 f5c872a4b99e07f33967a2f371d5a99e3ca251a5f78a53a499a58071a23507f8

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp39-cp39-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp39-cp39-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 0203dde75d72cc8516453f75b9c038b0194e0d861b99ba5b97078de4ffdd97de
MD5 75de0dccf779607bc1b2e5dedf38cd97
BLAKE2b-256 c2a6f289671c3344dcb4ac783fede3630b3488543ec1f5c013d0d16e1623ebd0

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 5d2aa539ab2a5fa8c85bdc52745b49736aafd7ac97912569aefbd3625db92789
MD5 ace35914648b367f800f530c1fa1e530
BLAKE2b-256 3c577fc0e59be3a39b0c969c872d9bd0daae1d33a17f3ff1119b6176a46de0ee

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp38-cp38-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp38-cp38-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 48ac8d265cf1766e6a861d66073783729809afeea0b5db0f7dbbcc907299f8ea
MD5 f18a202858433dfa0a56077836924146
BLAKE2b-256 97e38ff0934c9f73e12dd6fd7fa4c949b1fe009af24d66b241dddb6fc6c217dd

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp38-cp38-macosx_10_16_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp38-cp38-macosx_10_16_x86_64.whl
Algorithm Hash digest
SHA256 e18cc0d977898c09dae2058d1f4a7c83d529bfb8265d24b2177d1ad766ec17ef
MD5 8d359b69e311ca52df379c9c48607aa3
BLAKE2b-256 bb6248458b320a5a025fc389a70e88855a796b953b7b331c087d35450e4d44cc

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 e1047d8e1e04962fb539831d15da66fa1ebddc879e3dc04247e25c6e0473fdff
MD5 0700137fbfe89d93d458512b61cd9008
BLAKE2b-256 84d5b13a92721a32556610b96d87ef9475bec4996970a942d79368e1294b864e

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp37-cp37m-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp37-cp37m-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 57ccaf46e58eeee1042008c70a0819aa0435de66d9790b76a32dc6aceaf5c984
MD5 0ccd6568c3f3400e87244244c5ab312e
BLAKE2b-256 6f79c97e5882bff81068302db6c6444f28246d59845578ccb4954c3097e00299

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp37-cp37m-macosx_10_16_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp37-cp37m-macosx_10_16_x86_64.whl
Algorithm Hash digest
SHA256 4c9319cab35bda1fa8be1e9a29365a5bd3e77f0743d5aaefde665f944e716701
MD5 3531c91b33d89f5ca72c3a05e24ce76d
BLAKE2b-256 2195a2baf0d84561ec5689cf57050b969793d9273586ac9af1dbf8949ba207ac

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8a08106a62c4098392a53c65474ec77c6f6c20befc2fbb3451657007420d0790
MD5 532ea22f885b8d7d52f9fcd38b6639e4
BLAKE2b-256 f9d4a6d76c380165f4bcc92bf2fec9d5c29f0edc6cc0af7ac32b758d02561dad

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp36-cp36m-manylinux_2_24_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp36-cp36m-manylinux_2_24_x86_64.whl
Algorithm Hash digest
SHA256 bee36ccf1db343a068c1598380514e6f59e3c73df2b144f4502b59dcf803e209
MD5 2f2d4356ad390915a58fd75be9a030e7
BLAKE2b-256 9197eddc838609f6231d92f7e18528fd2a425effa90a775d739ec62e624317dd

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp36-cp36m-macosx_10_16_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp36-cp36m-macosx_10_16_x86_64.whl
Algorithm Hash digest
SHA256 45b108889290187f13d6f79054b9dd4eac80796bbd00c3ecd68f0f2ad1fa7f1c
MD5 21ad641746d8b7f8d88bae3ae3b9bb84
BLAKE2b-256 be6a5c3234bebc60fa5341c7fa8cf52c4eb5cbbb6a90baccdcf9436ddde76cbf

See more details on using hashes here.

File details

Details for the file extractnet-2.0.7-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for extractnet-2.0.7-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 3275b03f34db1d3813a8b5b4128bd9ed61546e84ff9f2b1fb1f45393b26a324a
MD5 85855bc7daebe8ce0b84a5c66f6362b4
BLAKE2b-256 3ac0e749891422ea5edc52d3ffa17d15f7de8a777bf100831cacdfc75c6c9033

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page