Minimal URL extraction / classification / HTTP check pipeline.
Project description
urlcheck-smith
A compact, fast URL analysis library and pipeline:
- Module Package First: Designed as a Python library for easy integration into your own scripts and data pipelines.
- CLI Utilities: Provides powerful command-line tools for extraction, classification, and database management.
- Extract URLs from arbitrary text files
- Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
- Trust Tier classification (Official, News, General)
- Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
- Output results as CSV or JSONL
- Standalone URL classifier (
classify-url) - Interactive HTTPS URL extractor (
extract-https) with CSV export (URL + SHA-256 hash) - Batch classification mode (
classify) - Database management command (
db) to enrich or add custom trusted domains - Supports custom YAML rules, explain mode, quiet mode
- Classification: Assigns categories (e.g., government, education) based on domain suffix rules from the built-in UC Smith database.
- HTTP Verification: Checks reachability and captures status codes.
- Soft 404 Detection: Identifies pages that return a
200 OKstatus but contain "Page Not Found" text. - Trust Tier Analysis: Automatically categorizes URLs into
TIER_1_OFFICIAL,TIER_2_RELIABLE, orTIER_3_GENERALusingTrustManager. - Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.
- Enrichment: Query the Google Fact Check API to scout for known misinformation flags and update the credibility score.
Features in Detail
Soft 404 Detection
Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom "not found" message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:
- "page not found"
- "error 404"
- "the page you requested cannot be found"
If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these "ghost" pages from your results.
Trust Tier Classification
To help prioritize analysis, urlcheck-smith assigns a trust tier to each URL:
- TIER_1_OFFICIAL: Government (
.gov,.go.jp, etc.), UN, and official international domains. - TIER_2_RELIABLE: Verified news organizations (Reuters, AP, BBC, etc.) and educational institutions.
- TIER_3_GENERAL: All other domains.
This is available via the trust_tier field in CSV/JSONL outputs.
Installation (development)
python3 -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
pytest
Usage Guide
urlcheck-smith is primarily a module package, which also provides a set of CLI utilities for common tasks.
0. extract-https — extract unique HTTPS URLs to CSV
Extract unique HTTPS URLs from a text file and export them to CSV (columns: URL, hashed_URL).
This command can be run either as a standalone console script:
extract-https --input sample.txt --output https_urls.csv
Or via the main urlcheck-smith CLI:
urlcheck-smith extract-https --input sample.txt --output https_urls.csv
If --input / --output are omitted, the command will prompt interactively. Leaving the output blank uses a timestamped default filename like https_urls_YYYYMMDD_HHMMSS.csv.
1. classify-url — classify a single URL
This is the most straightforward way to use the package, either as a library or via the CLI.
API Example (Library Usage)
If you want to classify a URL from Python, you can use the public API directly.
The example below shows a small helper script, scripts/classify_single_url.py, that demonstrates how to create a URL record and classify it.
from urlcheck_smith import SiteClassifier, UrlRecord
def classify_single_url(
url: str,
*,
rules_path: str | None = None,
explain: bool = False,
) -> dict:
classifier = SiteClassifier(
rules_path=rules_path,
explain=explain,
normalize_domain=True,
)
rec = classifier.classify([UrlRecord(url=url)])[0]
result = {
"url": rec.url,
"base_url": rec.base_url,
"category": rec.category,
"trust_tier": rec.trust_tier,
}
if rec.explain:
result["explain"] = rec.explain
return result
data = classify_single_url("https://www.itu.int/en/Pages/default.aspx", explain=True)
print(data)
API Workflow Explained
- Importing core components:
SiteClassifier(the engine) andUrlRecord(the data structure). - Initializing:
SiteClassifieris instantiated, enablingnormalize_domainfor consistency. - Classification:
classifier.classify(...)takes a list ofUrlRecordobjects. We pass a list with one item and take the first element ([0]). - Extracting Results: Provides the detected
category,trust_tier, and optionally theexplainrule.
CLI Example (Default JSON)
urlcheck-smith classify-url https://www.itu.int/en/Pages/default.aspx
CLI Example (Explain mode)
urlcheck-smith classify-url https://www.itu.int/en/Pages/default.aspx --explain
Output example:
{
"url": "https://www.itu.int/en/Pages/default.aspx",
"base_url": "www.itu.int",
"category": "international",
"trust_tier": "TIER_1_OFFICIAL",
"explain": "Matched pattern 'int' -> category 'international'"
}
CLI Example (Quiet mode)
urlcheck-smith classify-url https://www.itu.int/en/Pages/default.aspx --quiet
CLI Example (Custom rules)
urlcheck-smith classify-url https://policy.example.com/ --rules org_rules.yaml
2. scan — extract → classify → (optional) HTTP check
Extracts URLs from files and performs classification and optional HTTP checks.
CSV output (default)
urlcheck-smith scan sample.txt -o urls.csv
JSONL output
urlcheck-smith scan sample.txt \
--no-http \
--format jsonl \
-o urls.jsonl
Skip HTTP check
urlcheck-smith scan notes.txt --no-http -o urls_wo_status.csv
Custom rules
urlcheck-smith scan urls.txt \
--rules my_rules.yaml \
-o result.csv
Built-in rules
The system comes with a built-in database (ucsmith_db.yaml) containing thousands of government, educational, and news domains. These are used automatically.
3. classify — batch classify (no HTTP check)
Extracts and classifies URLs from input files.
CSV output
urlcheck-smith classify urls.txt -o classified.csv
JSONL output
urlcheck-smith classify urls.txt --format jsonl -o out.jsonl
Quiet mode
urlcheck-smith classify urls.txt --quiet
Explain mode
urlcheck-smith classify urls.txt --explain -o out.jsonl
Configuration
Rule Precedence
When multiple rule sources are used, they are prioritized as follows:
- User defined in database (
db addcommand) - User rules files (
--rulesflag) - Global rules in database (
ucsmith_db.yaml)
The system uses a Longest-Suffix-Match strategy. More specific rules (e.g., blog.google.com) will match before more general ones (e.g., google.com).
API Key (Optional)
A Google API key is only required for domain enrichment via the db update command. All core features (scanning, classification, HTTP checks) work without it.
This package reads the API key from:
UCSMITH_GOOGLE_API_KEY
Set it as follows:
export UCSMITH_GOOGLE_API_KEY="your-api-key"
If your key is stored under another variable name, map it:
export UCSMITH_GOOGLE_API_KEY="$YOUR_EXISTING_VAR"
- Service: Google Fact Check Tools API
- Usage: Used to scout for known misinformation flags to update domain credibility scores.
4. db — manage the UC Smith database
Manage your local credibility database (ucsmith_db.yaml).
Add a trusted domain
urlcheck-smith db add my-org.com --category organization
Remove a domain
urlcheck-smith db remove my-org.com
Enrich domains via Google Fact Check API
Scouts for known misinformation-related signals and updates the credibility score in the local cache.
Requires UCSMITH_GOOGLE_API_KEY to be set in the environment.
Update a single domain:
urlcheck-smith db update example.com
Update domains from a file:
urlcheck-smith db update --file domains.txt
Update all previously discovered domains:
urlcheck-smith db update --all
Requirements
- Set
UCSMITH_GOOGLE_API_KEYin your environment - An internet connection is required
- The result is stored in the local
ucsmith_db.yamlcache
If the API key is missing, the command cannot perform enrichment.
Example: Bulk Enrichment Workflow
- Export the API key:
export UCSMITH_GOOGLE_API_KEY="your-api-key"
- Prepare a list of URLs (one per line) in your current directory:
echo "https://example.com/page1" > domains.txt echo "https://malicious-site.org/news" >> domains.txt
- Run the update command:
urlcheck-smith db update --file domains.txt
- Check results:
Your local
usmith_db.yamldatabase is updated with credibility scores and flag counts. All subsequentscanorclassifycommands will now use these enriched scores for the domains found in your file.
Rule System
Custom rule file example (YAML)
Rule files can specify rules (a list of matchers) and optional default_category / default_trust_tier. Note: The internal database uses a simplified name field for both domains and suffixes.
rules:
- domain: "special.example.com"
category: "internal"
trust_tier: "TIER_1_OFFICIAL"
- suffix: "gov.uk"
category: "government"
trust_tier: "TIER_1_OFFICIAL"
default_category: "private"
default_trust_tier: "TIER_3_GENERAL"
Development
make install
make test
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file urlcheck_smith-0.8.0.tar.gz.
File metadata
- Download URL: urlcheck_smith-0.8.0.tar.gz
- Upload date:
- Size: 65.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38bd93dd95e90fd765a9f1217f1e019e3bc3301ffcc461bc71195d5adb76ad1e
|
|
| MD5 |
5b7637c078828723d009067e6647f17a
|
|
| BLAKE2b-256 |
a7362d8b425b39e3311187feeb82840107657f7d4bebd74e33ee0f6c39d8220b
|
File details
Details for the file urlcheck_smith-0.8.0-py3-none-any.whl.
File metadata
- Download URL: urlcheck_smith-0.8.0-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4e9ec46d341e06f7e7254dd9a44538b6aa8cc3efbfa6f7031535dea04bb82c0
|
|
| MD5 |
24902964ab2e98ed2a82a0425ab38c1e
|
|
| BLAKE2b-256 |
34099be91e3db51d883612cde70beb3bee6b2639e3fc1067a5a07303a0db2116
|