Skip to main content

Extract the main article content (and optionally comments) from a web page

Project description


Dragnet isn’t interested in the shiny chrome or boilerplate dressing of a web page. It’s interested in… ‘just the facts.’ The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.

For more information on our approach check out:


The build requires numpy, lxml and a new version of Cython, so first make sure they are installed, then install Dragnet:

pip install numpy
pip install --upgrade cython
pip install lxml
pip install dragnet


Depending on your use case, we provide two separate models to extract just the main article content or the content and any user generated comments. Each model implements the analyze method that takes an HTML string and returns the content string.

import requests
from dragnet import content_extractor, content_comments_extractor

# fetch HTML
url = ''
r = requests.get(url)

# get main article without comments
content = content_extractor.analyze(r.content)

# get article and comments
content_comments = content_comments_extractor.analyze(r.content)

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for dragnet, version 1.0.1
Filename, size File type Python version Upload date Hashes
Filename, size dragnet-1.0.1-cp27-none-macosx_10_10_intel.whl (1.1 MB) File type Wheel Python version cp27 Upload date Hashes View
Filename, size dragnet-1.0.1.tar.gz (923.8 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page