A classifier for detecting soft 404 pages
Project description
A “soft” 404 page is a page that is served with 200 status, but is really a page that says that content is not available.
Installation
pip install soft-404
Usage
The easiest way is to use the soft404.probability function:
>>> import soft404 >>> soft404.probability('<h1>Page not found</h1>') 0.9736860086882132
You can also create a classifier explicitly:
>>> from soft404 import Soft404Classifier >>> clf = Soft404Classifier() >>> clf.predict('<h1>Page not found</h1>') 0.9736860086882132
Development
Classifier is trained on 198801 pages from 35995 domains, with 404 page ratio of about 1/3. With 10-fold cross-validation, PR AUC (average precision) is 0.991 ± 0.002, and ROC AUC is 0.995 ± 0.002.
Getting data for training
Install dev requirements:
pip install -r requirements_dev.txt
Run the crawler for a while (results will appear in pages.jl.gz file):
cd crawler scrapy crawl spider -o pages.jl.gz -t jl.gz -s JOBDIR=job
Training
First, extract text and structure from html:
./soft404/convert_to_text.py pages.jl.gz items
This will produce two files, items.meta.jl.gz and items.items.jl.gz. Next, train the classifier:
./soft404/train.py items
Vectorizer takes a while to run, but it’s result is cached (the filename where it is cached will be printed on the next run). If you are happy with results, save the classifier:
./soft404/train.py items --save soft404/clf.joblib
License
License is MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for soft_404-0.4.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e89384c161f7fcbbb3a5bba5aae7d409335b157e4d69b3d554244e51b767f6be |
|
MD5 | a1cdfd7c3286a6204a2c0a102d1d7947 |
|
BLAKE2b-256 | bdc61960aa0dfe3875e69e07a62401cd459ca414d8595421f8bf4cc1696ffd93 |