Skip to main content

Minimal URL extraction / classification / HTTP check pipeline.

Project description

urlcheck-smith

PyPI version Python versions Status License Tests

A compact, fast URL analysis pipeline:

  • Extract URLs from arbitrary text files
  • Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
  • Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
  • Output results as CSV or JSONL
  • Standalone URL classifier (classify-url)
  • Batch classification mode (classify)
  • Supports rule presets (Japan/EU/global), custom YAML rules, explain mode, quiet mode

Installation (development)

python3 -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
pytest

Commands Overview


1. scan — extract → classify → (optional) HTTP check

CSV output (default)

urlcheck-smith scan sample.txt -o urls.csv

JSONL output

urlcheck-smith scan sample.txt \
  --no-http \
  --format jsonl \
  -o urls.jsonl

Skip HTTP check

urlcheck-smith scan notes.txt --no-http -o urls_wo_status.csv

Custom rules

urlcheck-smith scan urls.txt \
  --rules my_rules.yaml \
  -o result.csv

Built-in rule presets

urlcheck-smith scan urls.txt --preset japan -o out.csv
urlcheck-smith scan urls.txt --preset eu -o out.csv
urlcheck-smith scan urls.txt --preset global -o out.csv

2. classify-url — classify a single URL

Default (JSON)

urlcheck-smith classify-url https://www.soumu.go.jp/

Explain mode

urlcheck-smith classify-url https://www.soumu.go.jp/ --explain

Output example:

{
  "url": "https://www.soumu.go.jp/",
  "base_url": "www.soumu.go.jp",
  "category": "government",
  "explain": {
    "matched_suffix": ".go.jp",
    "category": "government"
  }
}

Quiet mode (machine-friendly)

urlcheck-smith classify-url https://www.soumu.go.jp/ --quiet

Presets & custom rules

urlcheck-smith classify-url https://www.gov.uk/ --preset eu
urlcheck-smith classify-url https://policy.example.com/ --rules org_rules.yaml

3. classify — batch classify (no HTTP check)

Input file should contain one URL per line.

CSV output

urlcheck-smith classify urls.txt -o classified.csv

JSONL output

urlcheck-smith classify urls.txt --format jsonl -o out.jsonl

Quiet mode

urlcheck-smith classify urls.txt --quiet

Explain mode

urlcheck-smith classify urls.txt --explain -o out.jsonl

Rule System

Custom rule file example

suffix_rules:
  - suffix: ".go.jp"
    category: government
  - suffix: ".example.com"
    category: internal

default_category: private

Built-in presets

  • --preset japan
  • --preset eu
  • --preset global

Each corresponds to a YAML file under urlcheck_smith/data/.


Development

make install
make test

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urlcheck_smith-0.1.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urlcheck_smith-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file urlcheck_smith-0.1.0.tar.gz.

File metadata

  • Download URL: urlcheck_smith-0.1.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for urlcheck_smith-0.1.0.tar.gz
Algorithm Hash digest
SHA256 66effb6bba920c854e3a566d94e3003d72211791f03c7752040c602b5bb08047
MD5 fa80608ae15166d767f761a7c967412e
BLAKE2b-256 2fba8ee7f0fb3bd7931ed7cd3d60650770a272e7dd348f4af35434549620aa3e

See more details on using hashes here.

File details

Details for the file urlcheck_smith-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: urlcheck_smith-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for urlcheck_smith-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7cf785bffb99f477347924592fe9e6cee96ba13ff10c595334e66a42c98204b0
MD5 83212620b004ca8074f7bc24c5f67123
BLAKE2b-256 090e88b08432d2f2232796fe0eab910aeb2477632609d070e6d0ae52d25fb770

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page