Minimal URL extraction / classification / HTTP check pipeline.

These details have not been verified by PyPI

Project links

Project description

urlcheck-smith

Python versions Status License Tests

A compact, fast URL analysis library and pipeline:

Module Package First: Designed as a Python library for easy integration into your own scripts and data pipelines.
CLI Utilities: Provides powerful command-line tools for extraction, classification, and database management.
Extract URLs from arbitrary text files
Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
Trust Tier classification (Official, News, General)
Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
Output results as CSV or JSONL
Standalone URL classifier (classify-url)
Batch classification mode (classify)
Database management command (db) to enrich or add custom trusted domains
Supports custom YAML rules, explain mode, quiet mode
Classification: Assigns categories (e.g., government, education) based on domain suffix rules from the built-in UC Smith database.
HTTP Verification: Checks reachability and captures status codes.
Soft 404 Detection: Identifies pages that return a 200 OK status but contain "Page Not Found" text.
Trust Tier Analysis: Automatically categorizes URLs into TIER_1_OFFICIAL, TIER_2_RELIABLE, or TIER_3_GENERAL using TrustManager.
Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.
Enrichment: Query the Google Fact Check API to scout for known misinformation flags and update the credibility score.

Features in Detail

Soft 404 Detection

Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom "not found" message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:

"page not found"
"error 404"
"the page you requested cannot be found"

If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these "ghost" pages from your results.

Trust Tier Classification

To help prioritize analysis, urlcheck-smith assigns a trust tier to each URL:

TIER_1_OFFICIAL: Government (.gov, .go.jp, etc.), UN, and official international domains.
TIER_2_RELIABLE: Verified news organizations (Reuters, AP, BBC, etc.) and educational institutions.
TIER_3_GENERAL: All other domains.

This is available via the trust_tier field in CSV/JSONL outputs.

Installation (development)

python3 -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
pytest

Usage Guide

urlcheck-smith is primarily a module package, which also provides a set of CLI utilities for common tasks.

1. `classify-url` — classify a single URL

This is the most straightforward way to use the package, either as a library or via the CLI.

API Example (Library Usage)

If you want to classify a URL from Python, you can use the public API directly. The example below shows a small helper script, scripts/classify_single_url.py, that demonstrates how to create a URL record and classify it.

from urlcheck_smith import SiteClassifier, UrlRecord

def classify_single_url(
        url: str,
        *,
        rules_path: str | None = None,
        explain: bool = False,
) -> dict:
    classifier = SiteClassifier(
        rules_path=rules_path,
        explain=explain,
        normalize_domain=True,
    )

    rec = classifier.classify([UrlRecord(url=url)])[0]

    result = {
        "url": rec.url,
        "base_url": rec.base_url,
        "category": rec.category,
        "trust_tier": rec.trust_tier,
    }

    if rec.explain:
        result["explain"] = rec.explain

    return result

data = classify_single_url("https://www.itu.int/en/Pages/default.aspx", explain=True)
print(data)

API Workflow Explained

Importing core components: SiteClassifier (the engine) and UrlRecord (the data structure).
Initializing: SiteClassifier is instantiated, enabling normalize_domain for consistency.
Classification: classifier.classify(...) takes a list of UrlRecord objects. We pass a list with one item and take the first element ([0]).
Extracting Results: Provides the detected category, trust_tier, and optionally the explain rule.

CLI Example (Default JSON)

urlcheck-smith classify-url https://www.itu.int/en/Pages/default.aspx

CLI Example (Explain mode)

urlcheck-smith classify-url https://www.itu.int/en/Pages/default.aspx --explain

Output example:

{
  "url": "https://www.itu.int/en/Pages/default.aspx",
  "base_url": "www.itu.int",
  "category": "international",
  "trust_tier": "TIER_1_OFFICIAL",
  "explain": "Matched pattern 'int' -> category 'international'"
}

CLI Example (Quiet mode)

urlcheck-smith classify-url https://www.itu.int/en/Pages/default.aspx --quiet

CLI Example (Custom rules)

urlcheck-smith classify-url https://policy.example.com/ --rules org_rules.yaml

2. `scan` — extract → classify → (optional) HTTP check

Extracts URLs from files and performs classification and optional HTTP checks.

CSV output (default)

urlcheck-smith scan sample.txt -o urls.csv

JSONL output

urlcheck-smith scan sample.txt \
  --no-http \
  --format jsonl \
  -o urls.jsonl

Skip HTTP check

urlcheck-smith scan notes.txt --no-http -o urls_wo_status.csv

Custom rules

urlcheck-smith scan urls.txt \
  --rules my_rules.yaml \
  -o result.csv

Built-in rules

The system comes with a built-in database (ucsmith_db.yaml) containing thousands of government, educational, and news domains. These are used automatically.

3. `classify` — batch classify (no HTTP check)

Extracts and classifies URLs from input files.

CSV output

urlcheck-smith classify urls.txt -o classified.csv

JSONL output

urlcheck-smith classify urls.txt --format jsonl -o out.jsonl

Quiet mode

urlcheck-smith classify urls.txt --quiet

Explain mode

urlcheck-smith classify urls.txt --explain -o out.jsonl

Configuration

Rule Precedence

When multiple rule sources are used, they are prioritized as follows:

User defined in database (db add command)
User rules files (--rules flag)
Global rules in database (ucsmith_db.yaml)

The system uses a Longest-Suffix-Match strategy. More specific rules (e.g., blog.google.com) will match before more general ones (e.g., google.com).

API Key (Optional)

A Google API key is only required for domain enrichment via the db update command. All core features (scanning, classification, HTTP checks) work without it.

This package reads the API key from:

UCSMITH_GOOGLE_API_KEY

Set it as follows:

export UCSMITH_GOOGLE_API_KEY="your-api-key"

If your key is stored under another variable name, map it:

export UCSMITH_GOOGLE_API_KEY="$YOUR_EXISTING_VAR"

Service: Google Fact Check Tools API
Usage: Used to scout for known misinformation flags to update domain credibility scores.

4. `db` — manage the UC Smith database

Manage your local credibility database (ucsmith_db.yaml).

Add a trusted domain

urlcheck-smith db add my-org.com --category organization

Remove a domain

urlcheck-smith db remove my-org.com

Enrich domains via Google Fact Check API

Scouts for known misinformation-related signals and updates the credibility score in the local cache.

Requires UCSMITH_GOOGLE_API_KEY to be set in the environment.

Update a single domain:

urlcheck-smith db update example.com

Update domains from a file:

urlcheck-smith db update --file domains.txt

Update all previously discovered domains:

urlcheck-smith db update --all

Requirements

Set UCSMITH_GOOGLE_API_KEY in your environment
An internet connection is required
The result is stored in the local ucsmith_db.yaml cache

If the API key is missing, the command cannot perform enrichment.

Example: Bulk Enrichment Workflow

Export the API key:

export UCSMITH_GOOGLE_API_KEY="your-api-key"

Prepare a list of URLs (one per line) in your current directory:

echo "https://example.com/page1" > domains.txt
echo "https://malicious-site.org/news" >> domains.txt

Run the update command:

urlcheck-smith db update --file domains.txt

Check results: Your local usmith_db.yaml database is updated with credibility scores and flag counts. All subsequent scan or classify commands will now use these enriched scores for the domains found in your file.

Rule System

Custom rule file example (YAML)

Rule files can specify rules (a list of matchers) and optional default_category / default_trust_tier. Note: The internal database uses a simplified name field for both domains and suffixes.

rules:
  - domain: "special.example.com"
    category: "internal"
    trust_tier: "TIER_1_OFFICIAL"
  - suffix: "gov.uk"
    category: "government"
    trust_tier: "TIER_1_OFFICIAL"

default_category: "private"
default_trust_tier: "TIER_3_GENERAL"

Development

make install
make test

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.0

Apr 29, 2026

This version

0.6.0

Apr 9, 2026

0.5.0

Apr 7, 2026

0.3.1

Apr 5, 2026

0.3.0

Mar 13, 2026

0.2.1

Jan 22, 2026

0.2.0

Jan 22, 2026

0.1.0

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urlcheck_smith-0.6.0.tar.gz (34.2 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

urlcheck_smith-0.6.0-py3-none-any.whl (30.0 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file urlcheck_smith-0.6.0.tar.gz.

File metadata

Download URL: urlcheck_smith-0.6.0.tar.gz
Upload date: Apr 9, 2026
Size: 34.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for urlcheck_smith-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`0a7cf6d6ec79b007dfbd4c4dcd28f14c86f708007a8be526415cc3ec25209f7a`
MD5	`5f7e81669b4c38ce6e083c6e2d3c32b1`
BLAKE2b-256	`c4e4807eadca0c1a956d43e7b2489707596a9832816045d5ad26401ce1b9cf72`

See more details on using hashes here.

File details

Details for the file urlcheck_smith-0.6.0-py3-none-any.whl.

File metadata

Download URL: urlcheck_smith-0.6.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 30.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for urlcheck_smith-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c0bffa5f6ec99805d553148414873fa7ecd0ca95b8335b0691721167cbb749b`
MD5	`c332437b59cb5a4d6a3863a3b8ae2c48`
BLAKE2b-256	`e287b23775f89b5fa0d7e36e7b45372ebe6c05e8eb5f7cf60a590d0e9bbd1f4c`

See more details on using hashes here.

urlcheck-smith 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

urlcheck-smith

Features in Detail

Soft 404 Detection

Trust Tier Classification

Installation (development)

Usage Guide

1. classify-url — classify a single URL

API Example (Library Usage)

API Workflow Explained

CLI Example (Default JSON)

CLI Example (Explain mode)

CLI Example (Quiet mode)

CLI Example (Custom rules)

2. scan — extract → classify → (optional) HTTP check

CSV output (default)

JSONL output

Skip HTTP check

Custom rules

Built-in rules

3. classify — batch classify (no HTTP check)

CSV output

JSONL output

Quiet mode

Explain mode

Configuration

Rule Precedence

API Key (Optional)

4. db — manage the UC Smith database

Add a trusted domain

Remove a domain

Enrich domains via Google Fact Check API

Example: Bulk Enrichment Workflow

Rule System

Custom rule file example (YAML)

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `classify-url` — classify a single URL

2. `scan` — extract → classify → (optional) HTTP check

3. `classify` — batch classify (no HTTP check)

4. `db` — manage the UC Smith database