This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Extract the main article content (and optionally comments) from a web page

Project Description

Dragnet

Dragnet isn’t interested in the shiny chrome or boilerplate dressing of a web page. It’s interested in… ‘just the facts.’ The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.

For more information on our approach check out:

Installing

The build requires numpy, lxml and a new version of Cython, so first make sure they are installed, then install Dragnet:

pip install numpy
pip install --upgrade cython
pip install lxml
pip install dragnet

GETTING STARTED

Depending on your use case, we provide two separate models to extract just the main article content or the content and any user generated comments. Each model implements the analyze method that takes an HTML string and returns the content string.

import requests
from dragnet import content_extractor, content_comments_extractor

# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)

# get main article without comments
content = content_extractor.analyze(r.content)

# get article and comments
content_comments = content_comments_extractor.analyze(r.content)
Release History

Release History

This version
History Node

1.1.0

History Node

1.0.2

History Node

1.0.1

History Node

1.0.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
dragnet-1.1.0.tar.gz (1.8 MB) Copy SHA256 Checksum SHA256 Source Jan 17, 2017

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting