A classifier for detecting soft 404 pages
Project description
A “soft” 404 page is a page that is served with 200 status, but is really a page that says that content is not available.
Installation
pip install soft404
Usage
The easiest way is to use the soft404.probability function:
>>> import soft404 >>> soft404.probability('<h1>Page not found</h1>') 0.9736860086882132
You can also create a classifier explicitly:
>>> from soft404 import Soft404Classifier >>> clf = Soft404Classifier() >>> clf.predict('<h1>Page not found</h1>') 0.9736860086882132
Development
Classifier is trained on 120k pages from 25k domains, with 404 page ratio of about 1/3. With 10-fold cross-validation, PR AUC (average precision) is 0.990 ± 0.003, and ROC AUC is 0.995 ± 0.002.
Getting data for training
Install dev requirements:
pip install -r requirements_dev.txt
Run the crawler for a while (results will appear in pages.jl.gz file):
cd crawler scrapy crawl spider -o gzip:pages.jl -s JOBDIR=job
Training
First, extract text and structure from html:
./soft404/convert_to_text.py pages.jl.gz items
This will produce two files, items.meta.jl.gz and items.items.jl.gz. Next, train the classifier:
./soft404/train.py items
Vectorizer takes a while to run, but it’s result is cached (the filename where it is cached will be printed on the next run). If you are happy with results, save the classifier:
./soft404/train.py items --save soft404/clf.joblib
License
License is MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file soft404-0.2.1.tar.gz
.
File metadata
- Download URL: soft404-0.2.1.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09d2cf8bd6264d542f3f8613e3d8693e229eca149b203b30822506e4e0805c4f |
|
MD5 | cdeef15dd9456a109f3354bce5376410 |
|
BLAKE2b-256 | de161691f87f56a6a8ef0bbd164b389b1ab97a3fe82556921a1d619dcb5ba6a7 |