Minimal URL extraction / classification / HTTP check pipeline.

These details have not been verified by PyPI

Project links

Project description

urlcheck-smith

Python versions Status License Tests

A compact, fast URL analysis pipeline:

Extract URLs from arbitrary text files
Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
Trust Tier classification (Official, News, General)
Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
Output results as CSV or JSONL
Standalone URL classifier (classify-url)
Batch classification mode (classify)
Supports rule presets (Japan/EU/global), custom YAML rules, explain mode, quiet mode
Classification: Assigns categories (e.g., government, education) based on domain suffix rules.
HTTP Verification: Checks reachability and captures status codes.
Soft 404 Detection: Identifies pages that return a 200 OK status but contain "Page Not Found" text.
Trust Tier Analysis: Automatically categorizes URLs into TIER_1_OFFICIAL, TIER_2_NEWS, or TIER_3_GENERAL using TrustManager.
Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.

Features in Detail

Soft 404 Detection

Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom "not found" message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:

"page not found"
"error 404"
"the page you requested cannot be found"

If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these "ghost" pages from your results.

Trust Tier Classification

To help prioritize analysis, urlcheck-smith assigns a trust tier to each URL:

TIER_1_OFFICIAL: Government (.gov, .go.jp, etc.), UN, and official EU domains.
TIER_2_NEWS: Major news organizations (Reuters, AP, BBC, etc.).
TIER_3_GENERAL: All other domains.

This is available via the trust_tier field in CSV/JSONL outputs.

Installation (development)

python3 -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
pytest

Commands Overview

1. `scan` — extract → classify → (optional) HTTP check

CSV output (default)

urlcheck-smith scan sample.txt -o urls.csv

JSONL output

urlcheck-smith scan sample.txt \
  --no-http \
  --format jsonl \
  -o urls.jsonl

Skip HTTP check

urlcheck-smith scan notes.txt --no-http -o urls_wo_status.csv

Custom rules

urlcheck-smith scan urls.txt \
  --rules my_rules.yaml \
  -o result.csv

Built-in rule presets

urlcheck-smith scan urls.txt --preset japan -o out.csv
urlcheck-smith scan urls.txt --preset eu -o out.csv
urlcheck-smith scan urls.txt --preset global -o out.csv

2. `classify-url` — classify a single URL

Default (JSON)

urlcheck-smith classify-url https://www.soumu.go.jp/

Explain mode

urlcheck-smith classify-url https://www.soumu.go.jp/ --explain

Output example:

{
  "url": "https://www.soumu.go.jp/",
  "base_url": "www.soumu.go.jp",
  "category": "government",
  "trust_tier": "TIER_1_OFFICIAL",
  "explain": {
    "matched_suffix": ".go.jp",
    "category": "government"
  }
}

Quiet mode (machine-friendly)

urlcheck-smith classify-url https://www.soumu.go.jp/ --quiet

Presets & custom rules

urlcheck-smith classify-url https://www.gov.uk/ --preset eu
urlcheck-smith classify-url https://policy.example.com/ --rules org_rules.yaml

3. `classify` — batch classify (no HTTP check)

Input file should contain one URL per line.

CSV output

urlcheck-smith classify urls.txt -o classified.csv

JSONL output

urlcheck-smith classify urls.txt --format jsonl -o out.jsonl

Quiet mode

urlcheck-smith classify urls.txt --quiet

Explain mode

urlcheck-smith classify urls.txt --explain -o out.jsonl

Rule Precedence & Discrepancies

When multiple rule sources are used, they are prioritized as follows:

User rules (--rules): Rules from these files are checked first. If multiple files are provided, the one specified last has the highest priority.
Base rules (--preset or default): These are checked only if no user rule matches.

The system uses a First-Match-Wins strategy. The first rule that matches the URL (by domain, suffix, or regex) determines the category and trust tier.

Rule System

Custom rule file example (YAML)

Rule files can specify rules (a list of matchers) and optional default_category / default_trust_tier.

rules:
  - domain: "special.example.com"
    category: "internal"
    trust_tier: "TIER_1_OFFICIAL"
  - suffix: ".gov.uk"
    category: "government"
    trust_tier: "TIER_1_OFFICIAL"
  - regex: ".*-news\\.com$"
    category: "news"
    trust_tier: "TIER_2_RELIABLE"

default_category: "private"
default_trust_tier: "TIER_3_GENERAL"

Built-in presets

--preset japan
--preset eu
--preset global

Each corresponds to a YAML file under urlcheck_smith/data/.

Development

make install
make test

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.0

Apr 29, 2026

0.6.0

Apr 9, 2026

0.5.0

Apr 7, 2026

This version

0.3.1

Apr 5, 2026

0.3.0

Mar 13, 2026

0.2.1

Jan 22, 2026

0.2.0

Jan 22, 2026

0.1.0

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urlcheck_smith-0.3.1.tar.gz (17.1 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

urlcheck_smith-0.3.1-py3-none-any.whl (15.6 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file urlcheck_smith-0.3.1.tar.gz.

File metadata

Download URL: urlcheck_smith-0.3.1.tar.gz
Upload date: Apr 5, 2026
Size: 17.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for urlcheck_smith-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`5cc12ea81525dea1338226ffe5fd12b7b4bd43b319ff9d18d7a751fc86d59adb`
MD5	`5b0422808ed5078133c10eb435f16b4b`
BLAKE2b-256	`f1166333f13f7bfdfeb7f94a16dac9553dd1bd902a08b816f672284f022fea85`

See more details on using hashes here.

File details

Details for the file urlcheck_smith-0.3.1-py3-none-any.whl.

File metadata

Download URL: urlcheck_smith-0.3.1-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for urlcheck_smith-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1f87daba7686d5a5384cf749b17eab149622637f524c0719f8483e66a02710c4`
MD5	`f27cea24ed60e6d41396ba380666e7d0`
BLAKE2b-256	`1982acfa7e13669ecfd0b6c0f4a3bff25ffe86d1b63aeb087b24b2d35d469431`

See more details on using hashes here.

urlcheck-smith 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

urlcheck-smith

Features in Detail

Soft 404 Detection

Trust Tier Classification

Installation (development)

Commands Overview

1. scan — extract → classify → (optional) HTTP check

CSV output (default)

JSONL output

Skip HTTP check

Custom rules

Built-in rule presets

2. classify-url — classify a single URL

Default (JSON)

Explain mode

Quiet mode (machine-friendly)

Presets & custom rules

3. classify — batch classify (no HTTP check)

CSV output

JSONL output

Quiet mode

Explain mode

Rule Precedence & Discrepancies

Rule System

Custom rule file example (YAML)

Built-in presets

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `scan` — extract → classify → (optional) HTTP check

2. `classify-url` — classify a single URL

3. `classify` — batch classify (no HTTP check)