Minimal URL extraction / classification / HTTP check pipeline.
Project description
urlcheck-smith
A compact, fast URL analysis pipeline:
- Extract URLs from arbitrary text files
- Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
- Trust Tier classification (Official, News, General)
- Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
- Output results as CSV or JSONL
- Standalone URL classifier (
classify-url) - Batch classification mode (
classify) - Supports rule presets (Japan/EU/global), custom YAML rules, explain mode, quiet mode
- Classification: Assigns categories (e.g., government, education) based on domain suffix rules.
- HTTP Verification: Checks reachability and captures status codes.
- Soft 404 Detection: Identifies pages that return a
200 OKstatus but contain "Page Not Found" text. - Trust Tier Analysis: Automatically categorizes URLs into
TIER_1_OFFICIAL,TIER_2_NEWS, orTIER_3_GENERALusingTrustManager. - Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.
Features in Detail
Soft 404 Detection
Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom "not found" message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:
- "page not found"
- "error 404"
- "the page you requested cannot be found"
If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these "ghost" pages from your results.
Trust Tier Classification
To help prioritize analysis, urlcheck-smith assigns a trust tier to each URL:
- TIER_1_OFFICIAL: Government (
.gov,.go.jp, etc.), UN, and official EU domains. - TIER_2_NEWS: Major news organizations (Reuters, AP, BBC, etc.).
- TIER_3_GENERAL: All other domains.
This is available via the trust_tier field in CSV/JSONL outputs.
Installation (development)
python3 -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
pytest
Commands Overview
1. scan — extract → classify → (optional) HTTP check
CSV output (default)
urlcheck-smith scan sample.txt -o urls.csv
JSONL output
urlcheck-smith scan sample.txt \
--no-http \
--format jsonl \
-o urls.jsonl
Skip HTTP check
urlcheck-smith scan notes.txt --no-http -o urls_wo_status.csv
Custom rules
urlcheck-smith scan urls.txt \
--rules my_rules.yaml \
-o result.csv
Built-in rule presets
urlcheck-smith scan urls.txt --preset japan -o out.csv
urlcheck-smith scan urls.txt --preset eu -o out.csv
urlcheck-smith scan urls.txt --preset global -o out.csv
2. classify-url — classify a single URL
Default (JSON)
urlcheck-smith classify-url https://www.soumu.go.jp/
Explain mode
urlcheck-smith classify-url https://www.soumu.go.jp/ --explain
Output example:
{
"url": "https://www.soumu.go.jp/",
"base_url": "www.soumu.go.jp",
"category": "government",
"trust_tier": "TIER_1_OFFICIAL",
"explain": {
"matched_suffix": ".go.jp",
"category": "government"
}
}
Quiet mode (machine-friendly)
urlcheck-smith classify-url https://www.soumu.go.jp/ --quiet
Presets & custom rules
urlcheck-smith classify-url https://www.gov.uk/ --preset eu
urlcheck-smith classify-url https://policy.example.com/ --rules org_rules.yaml
3. classify — batch classify (no HTTP check)
Input file should contain one URL per line.
CSV output
urlcheck-smith classify urls.txt -o classified.csv
JSONL output
urlcheck-smith classify urls.txt --format jsonl -o out.jsonl
Quiet mode
urlcheck-smith classify urls.txt --quiet
Explain mode
urlcheck-smith classify urls.txt --explain -o out.jsonl
Rule Precedence & Discrepancies
When multiple rule sources are used, they are prioritized as follows:
- User rules (
--rules): Rules from these files are checked first. If multiple files are provided, the one specified last has the highest priority. - Base rules (
--presetor default): These are checked only if no user rule matches.
The system uses a First-Match-Wins strategy. The first rule that matches the URL (by domain, suffix, or regex) determines the category and trust tier.
Rule System
Custom rule file example (YAML)
Rule files can specify rules (a list of matchers) and optional default_category / default_trust_tier.
rules:
- domain: "special.example.com"
category: "internal"
trust_tier: "TIER_1_OFFICIAL"
- suffix: ".gov.uk"
category: "government"
trust_tier: "TIER_1_OFFICIAL"
- regex: ".*-news\\.com$"
category: "news"
trust_tier: "TIER_2_RELIABLE"
default_category: "private"
default_trust_tier: "TIER_3_GENERAL"
Built-in presets
--preset japan--preset eu--preset global
Each corresponds to a YAML file under urlcheck_smith/data/.
Development
make install
make test
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file urlcheck_smith-0.3.1.tar.gz.
File metadata
- Download URL: urlcheck_smith-0.3.1.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cc12ea81525dea1338226ffe5fd12b7b4bd43b319ff9d18d7a751fc86d59adb
|
|
| MD5 |
5b0422808ed5078133c10eb435f16b4b
|
|
| BLAKE2b-256 |
f1166333f13f7bfdfeb7f94a16dac9553dd1bd902a08b816f672284f022fea85
|
File details
Details for the file urlcheck_smith-0.3.1-py3-none-any.whl.
File metadata
- Download URL: urlcheck_smith-0.3.1-py3-none-any.whl
- Upload date:
- Size: 15.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f87daba7686d5a5384cf749b17eab149622637f524c0719f8483e66a02710c4
|
|
| MD5 |
f27cea24ed60e6d41396ba380666e7d0
|
|
| BLAKE2b-256 |
1982acfa7e13669ecfd0b6c0f4a3bff25ffe86d1b63aeb087b24b2d35d469431
|