Extract the main article content (and optionally comments) from a web page
Project description
Dragnet
Dragnet isn’t interested in the shiny chrome or boilerplate dressing of a web page. It’s interested in… ‘just the facts.’ The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks.
For more information on our approach check out:
The Dragnet homepage
Our paper Content Extraction Using Diverse Feature Sets, published at WWW in 2013, gives an overview of the machine learning approach.
A comparison of Dragnet and alternate content extraction packages.
This blog post explains the intuition behind the algorithms.
GETTING STARTED
Depending on your use case, we provide two separate models to extract just the main article content or the content and any user generated comments. Each model implements the analyze method that takes an HTML string and returns the content string.
import requests
from dragnet import content_extractor, content_comments_extractor
# fetch HTML
url = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'
r = requests.get(url)
# get main article without comments
content = content_extractor.analyze(r.content)
# get article and comments
content_comments = content_comments_extractor.analyze(r.content)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dragnet-1.0.0.tar.gz.
File metadata
- Download URL: dragnet-1.0.0.tar.gz
- Upload date:
- Size: 922.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9907843d1e7aea239907c748ab9313cbe3974dbc467f955e5b3c1f6270f5c55
|
|
| MD5 |
1a71b6ad3ad87d98488e0dc4a2d848f6
|
|
| BLAKE2b-256 |
7cba81ca8ac1d42248495a765de3f26fb5a3b1dc5d0f66eab739bd9e598a52bc
|
File details
Details for the file dragnet-1.0.0-cp27-none-macosx_10_10_intel.whl.
File metadata
- Download URL: dragnet-1.0.0-cp27-none-macosx_10_10_intel.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 2.7, macOS 10.10+ Intel (x86-64, i386)
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f8e47495322ca02c1540a29dbdf61add117c26044261501484f22d9b5994559
|
|
| MD5 |
84fb8d211155099f6406c2ada3f63578
|
|
| BLAKE2b-256 |
34688434e4cfdec13447b4455ce20099f63f4f74eb47f8bc9325c42ea6bf67ab
|