Skip to main content

File semantic integrity validator — fills the gap ZFS checksums miss

Project description

SemanticDog

Your NAS keeps your files safe from hardware failure. SemanticDog checks they're still actually openable.

ZFS and RAID verify that bits on disk match what was written. That's not the same as verifying a JPEG can be decoded, a RAW file parsed, or a PDF opened. Bit-rot, partial writes, and failed copies can produce files that pass every checksum but are silently broken at the application layer — you won't find out until you need them.

SemanticDog scans your library on a schedule, tells you which files are corrupt, and alerts you before you need them.

Works with AI agents. SemanticDog exposes an MCP server — Claude and other agents can query scan results, trigger scans, and reason about your library health directly.


Install

# with pip
pip install semanticdog

# with uv
uv tool install semanticdog

Then verify your system has the tools it needs:

sdog check-deps

The only hard requirement is Python 3.12+. Install ffmpeg for video, pillow-heif for HEIC — everything else is bundled.


First scan

sdog scan /mnt/photos

Results go into a local SQLite database. Progress is printed to stderr every 5 seconds:

Discovered 15234 files.
Scan ID: abc123-...  (resume with: sdog scan --resume abc123-...)
  [1500/15234]  9.8%  ok:1498  corrupt:2  unreadable:0  43.1 f/s  ETA: ~3.2 min

When it finishes:

sdog show-stats      # health dashboard — "is everything OK?" aggregate view
sdog report          # drill-down — lists individual corrupt files with error details

Exit code is 0 if everything is clean, 2 if issues were found — works naturally in scripts and CI.

Resuming an interrupted scan

If a scan is interrupted (Ctrl+C, SIGTERM, crash), it stays resumable:

sdog scan --resume abc123-...

Use sdog list-scans to find incomplete scan IDs.


What the results mean

Status What it means What to do
ok File opened and parsed successfully Nothing
corrupt File is structurally broken Restore from backup
unreadable Couldn't open the file at all Check mount / permissions — usually not the file's fault
unsupported Library version doesn't recognise this format variant Update libraries; not flagged as corrupt
error Validator crashed or timed out Check sdog report --format json for details

unreadable usually means a mount problem, not corruption. If you suddenly see many unreadable files, check your NAS connectivity before investigating individual files.


Supported formats

Photos: JPEG · PNG · TIFF · HEIC · WebP
RAW: CR2 · CR3 · NEF · ARW · ORF · RW2 · PEF · DNG · RAF · NRW
Documents: PDF · DOCX · XLSX · PPTX · DOC · XLS · PPT
Video: MP4 · MOV · MTS · M4V · MKV
Audio: MP3 · FLAC · WAV · AAC


Scheduled scanning

0 2 * * * sdog scan --config /data/config/config.yaml >> /data/logs/sdog.log 2>&1

On subsequent runs, only changed files are re-validated. A 100k-photo library might take an hour on first scan and two minutes after that.


Notifications

Get alerted when corrupt files are found.

Email:

notify_email: you@example.com
smtp_host: smtp.example.com
smtp_user: sdog@example.com
smtp_pass: ""   # use SDOG_SMTP_PASS env var

Webhook (Gotify, Ntfy, Pushover, Slack):

webhook_url: https://gotify.example.com/message?token=abc

Alerts only fire on the first detection — no repeat notifications for the same broken file.


AI agent integration (MCP)

SemanticDog has a built-in MCP server. Connect Claude or any MCP-compatible agent to query scan results and trigger scans conversationally.

Enable in config:

mcp_enabled: true
mcp_allow_write: true   # lets agents trigger scans and reset records
SDOG_MCP_AUTH_TOKEN=your-secret uvicorn semanticdog.server:app --port 9090

Add to Claude Code (~/.claude/settings.json):

{
  "mcpServers": {
    "semanticdog": {
      "type": "sse",
      "url": "http://localhost:9090/mcp/sse",
      "headers": { "Authorization": "Bearer your-secret" }
    }
  }
}

Once connected, you can ask Claude things like "which photos are corrupt?" or "scan my 2024 folder and summarize the results".


Configuration

Config is loaded automatically from the first location found:

  1. ./config.yaml (current directory)
  2. ~/.config/semanticdog/config.yaml
  3. /data/config/config.yaml (Docker/NAS default)

Override with --config /path/to/config.yaml on any command.

paths:
  - /mnt/photos
  - /mnt/documents

db_path: ~/.local/share/semanticdog/state.db   # or /data/state/state.db in Docker

workers: 4        # parallel validators
raw_workers: 2    # RAW uses more memory — keep lower than workers

schedule: "0 2 * * *"

Every option has a matching SDOG_* environment variable. Env vars always override the YAML file. Full reference is in config.example.yaml.


HTTP API and Prometheus

uvicorn semanticdog.server:app --port 9090
  • GET /metrics — Prometheus scrape endpoint
  • POST /trigger — kick off a scan remotely (also accepts {"scope": "/mnt/photos/2024"})
  • GET /status — current state and file counts as JSON

Troubleshooting

New camera RAW files show unsupported
LibRaw adds new cameras gradually. unsupported is not corruption — the file is fine, just unrecognised. Fix: pip install -U rawpy.

Many unreadable files suddenly
Almost always a mount going offline or a permission change. SemanticDog flags this as a suspected mount failure in the notification if more than half the scan is unreadable.

HEIC not validating
Needs pillow-heif: pip install pillow-heif. Run sdog check-deps to see everything that's missing at once.

Video not validating
Needs ffmpeg: apt install ffmpeg / brew install ffmpeg.

Moved your library to a new path

sdog db-export -o backup.json
sdog db-import -i backup.json --path-map /old/path:/new/path

AI Agent Reference — structured data for agents and tooling

Project identity

name:       semanticdog
binary:     sdog
module:     semanticdog
python:     >=3.12
entrypoint: semanticdog/cli.py

Repository layout

semanticdog/
  cli.py            CLI — all commands (typer)
  config.py         Config dataclass + load_config() + env override
  db.py             Database — SQLite WAL, all queries
  scanner.py        Scanner + walk_paths() + _validate_file() pebble worker
  server.py         FastAPI — /health /metrics /status /trigger + build_app()
  notify.py         ScanSummary, Notifier, SmtpNotifier, WebhookNotifier
  mcp_server.py     MCP SSE transport
  exceptions.py     ConfigError, DatabaseError, LockError
  validators/
    __init__.py     registry: register(), get_validator(), all_extensions()
    base.py         BaseValidator, ValidationResult, DependencyReport
    images.py       JpegValidator PngValidator TiffValidator HeicValidator WebpValidator
    raw.py          RawValidator
    documents.py    PdfValidator OoxmlValidator OleValidator
    media.py        VideoValidator AudioValidator
tests/
  fixtures/generators.py   make_minimal_jpeg, make_corrupt_jpeg, make_minimal_png, ...
  test_e2e.py               37 end-to-end tests (no mocks, real files)
  test_server.py / test_scanner.py / test_db.py / test_notify.py / test_*.py

CLI exit codes

sdog scan: 0 = all OK · 1 = config/DB/scan error · 2 = corrupt or unreadable files found · 130 = interrupted (Ctrl+C)
sdog check-deps: 0 = all hard deps present · 1 = hard dep missing

HTTP API

GET  /health      → 200 {"status":"ok"}
GET  /status      → 200 {status, files_indexed, by_status, last_scan}
GET  /metrics     → 200 Prometheus text
POST /trigger     → 200 {status:"complete", scan_id}
                    400 scope outside configured roots
                    409 scan already running
                    429 cooldown {retry_after_s}
                    503 not configured
GET  /mcp/sse     → SSE stream (requires mcp_enabled=true + mcp_auth_token)

Config keys → env vars

Key Env var Default
paths SDOG_PATHS (colon-sep) []
exclude SDOG_EXCLUDE (colon-sep) ["**/@eaDir/**", ...]
db_path SDOG_DB_PATH /data/state/state.db
workers SDOG_WORKERS 4
raw_workers SDOG_RAW_WORKERS 2
raw_decode_depth SDOG_RAW_DECODE_DEPTH structure
validation_timeout_s SDOG_VALIDATION_TIMEOUT_S 120
force_recheck_days SDOG_FORCE_RECHECK_DAYS 90
http_port SDOG_HTTP_PORT 9090
notify_email SDOG_NOTIFY_EMAIL ""
smtp_pass SDOG_SMTP_PASS ""
webhook_url SDOG_WEBHOOK_URL ""
mcp_enabled SDOG_MCP_ENABLED false
mcp_auth_token SDOG_MCP_AUTH_TOKEN ""
mcp_allow_write SDOG_MCP_ALLOW_WRITE false
mcp_rate_limit_s SDOG_MCP_RATE_LIMIT_S 60

Database schema

files (
  path TEXT PRIMARY KEY,
  mtime REAL, size INTEGER,
  status TEXT,           -- ok|corrupt|unreadable|unsupported|error
  error TEXT, suggested_action TEXT,
  checked_at TEXT,       -- ISO 8601
  scan_id TEXT,
  notified_at TEXT       -- NULL = not yet notified
)
scans (
  id TEXT PRIMARY KEY,
  started_at TEXT, finished_at TEXT,  -- finished_at NULL = incomplete/resumable
  total INTEGER, corrupt INTEGER, unreadable INTEGER,
  scope TEXT,            -- NULL = all paths
  files_per_sec REAL
)
scan_queue (
  scan_id TEXT, path TEXT,
  done INTEGER DEFAULT 0  -- 0 = pending, 1 = complete
)

Key internal APIs

from semanticdog.config import load_config
from semanticdog.db import Database
from semanticdog.scanner import Scanner

cfg   = load_config("config.yaml")                        # YAML + env override
db    = Database(cfg.db_path)
stats = Scanner(cfg, db).scan()                           # all paths → ScanStats
stats = Scanner(cfg, db).scan(["/sub"])                   # scoped scan
stats = Scanner(cfg, db).scan(resume_scan_id="abc123-…") # resume interrupted scan

# stats.scan_id — ID of the scan just run (for resume)

db.get_corrupt_files(since="2025-01-01", ext="cr2", path_prefix="/mnt")
db.get_stats()         # {"total": N, "total_size_bytes": N, "by_status": {...}}
db.get_format_counts() # [(ext, count), ...] sorted by count
db.get_stale_count(days=90)
db.get_top_errors(limit=5)
db.get_scan(scan_id)   # single scan record dict
db.list_scans(limit=10)
db.export_json()
db.import_json(records, force=False, path_map={"/old": "/new"})

Running tests

uv run pytest                       # 427 tests
uv run pytest tests/test_e2e.py -v  # E2E only (real files, no mocks)

Known limitations

  • RAW unsupported ≠ corrupt — LibRaw doesn't cover all camera models
  • HEIC: primary frame only; burst/live photo secondary frames skipped
  • Sidecars (.XMP, .AAE): validated independently, no pair correlation
  • verify-hashes command: not yet implemented

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semanticdog-0.2.0.tar.gz (144.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semanticdog-0.2.0-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file semanticdog-0.2.0.tar.gz.

File metadata

  • Download URL: semanticdog-0.2.0.tar.gz
  • Upload date:
  • Size: 144.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for semanticdog-0.2.0.tar.gz
Algorithm Hash digest
SHA256 346e1cb1b876603c6bf077c0902445d1d345f4dfa3d4afbe8edcf3a3cea4c600
MD5 5cfe6f1e76fe5105c3db65a5710ef1f7
BLAKE2b-256 e35c06bceb7b8509243096f95e2fd13b934f72aa628cdf6c3a1498f83107216c

See more details on using hashes here.

Provenance

The following attestation bundles were made for semanticdog-0.2.0.tar.gz:

Publisher: release.yml on kytmanov/semantic-dog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semanticdog-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: semanticdog-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for semanticdog-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32d458f85c3f454392943ff8f67b66d4a1fcfaebe5dcecddcd2d4bd742fe1ed6
MD5 c33eea617f8aa196194e1d194723eba6
BLAKE2b-256 57195a44dd86cbfd99fde86a07c27eb71285abc23ce9edfb8370ca05a8cc3177

See more details on using hashes here.

Provenance

The following attestation bundles were made for semanticdog-0.2.0-py3-none-any.whl:

Publisher: release.yml on kytmanov/semantic-dog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page