File semantic integrity validator — fills the gap ZFS checksums miss
Project description
SemanticDog
Your NAS keeps your files safe from hardware failure. SemanticDog checks they're still actually openable.
ZFS and RAID verify that bits on disk match what was written. That's not the same as verifying a JPEG can be decoded, a RAW file parsed, or a PDF opened. Bit-rot, partial writes, and failed copies can produce files that pass every checksum but are silently broken at the application layer — you won't find out until you need them.
SemanticDog scans your library on a schedule, tells you which files are corrupt, and alerts you before you need them.
Works with AI agents. SemanticDog exposes an MCP server — Claude and other agents can query scan results, trigger scans, and reason about your library health directly.
Install
# with pip
pip install semanticdog
# with uv
uv tool install semanticdog
Then verify your system has the tools it needs:
sdog check-deps
The only hard requirement is Python 3.12+. Install ffmpeg for video, pillow-heif for HEIC — everything else is bundled.
First scan
sdog scan /mnt/photos
Results go into a local SQLite database. Progress is printed to stderr every 5 seconds:
Discovered 15234 files.
Scan ID: abc123-... (resume with: sdog scan --resume abc123-...)
[1500/15234] 9.8% ok:1498 corrupt:2 unreadable:0 43.1 f/s ETA: ~3.2 min
When it finishes:
sdog show-stats # health dashboard — "is everything OK?" aggregate view
sdog report # drill-down — lists individual corrupt files with error details
Exit code is 0 if everything is clean, 2 if issues were found — works naturally in scripts and CI.
Resuming an interrupted scan
If a scan is interrupted (Ctrl+C, SIGTERM, crash), it stays resumable:
sdog scan --resume abc123-...
Use sdog list-scans to find incomplete scan IDs.
What the results mean
| Status | What it means | What to do |
|---|---|---|
ok |
File opened and parsed successfully | Nothing |
corrupt |
File is structurally broken | Restore from backup |
unreadable |
Couldn't open the file at all | Check mount / permissions — usually not the file's fault |
unsupported |
Library version doesn't recognise this format variant | Update libraries; not flagged as corrupt |
error |
Validator crashed or timed out | Check sdog report --format json for details |
unreadable usually means a mount problem, not corruption. If you suddenly see many unreadable files, check your NAS connectivity before investigating individual files.
Supported formats
Photos: JPEG · PNG · TIFF · HEIC · WebP
RAW: CR2 · CR3 · NEF · ARW · ORF · RW2 · PEF · DNG · RAF · NRW
Documents: PDF · DOCX · XLSX · PPTX · DOC · XLS · PPT
Video: MP4 · MOV · MTS · M4V · MKV
Audio: MP3 · FLAC · WAV · AAC
Scheduled scanning
0 2 * * * sdog scan --config /data/config/config.yaml >> /data/logs/sdog.log 2>&1
On subsequent runs, only changed files are re-validated. A 100k-photo library might take an hour on first scan and two minutes after that.
Notifications
Get alerted when corrupt files are found.
Email:
notify_email: you@example.com
smtp_host: smtp.example.com
smtp_user: sdog@example.com
smtp_pass: "" # use SDOG_SMTP_PASS env var
Webhook (Gotify, Ntfy, Pushover, Slack):
webhook_url: https://gotify.example.com/message?token=abc
Alerts only fire on the first detection — no repeat notifications for the same broken file.
AI agent integration (MCP)
SemanticDog has a built-in MCP server. Connect Claude or any MCP-compatible agent to query scan results and trigger scans conversationally.
Enable in config:
mcp_enabled: true
mcp_allow_write: true # lets agents trigger scans and reset records
SDOG_MCP_AUTH_TOKEN=your-secret uvicorn semanticdog.server:app --port 9090
Add to Claude Code (~/.claude/settings.json):
{
"mcpServers": {
"semanticdog": {
"type": "sse",
"url": "http://localhost:9090/mcp/sse",
"headers": { "Authorization": "Bearer your-secret" }
}
}
}
Once connected, you can ask Claude things like "which photos are corrupt?" or "scan my 2024 folder and summarize the results".
Configuration
Config is loaded automatically from the first location found:
./config.yaml(current directory)~/.config/semanticdog/config.yaml/data/config/config.yaml(Docker/NAS default)
Override with --config /path/to/config.yaml on any command.
paths:
- /mnt/photos
- /mnt/documents
db_path: ~/.local/share/semanticdog/state.db # or /data/state/state.db in Docker
workers: 4 # parallel validators
raw_workers: 2 # RAW uses more memory — keep lower than workers
schedule: "0 2 * * *"
Every option has a matching SDOG_* environment variable. Env vars always override the YAML file. Full reference is in config.example.yaml.
HTTP API and Prometheus
uvicorn semanticdog.server:app --port 9090
GET /metrics— Prometheus scrape endpointPOST /trigger— kick off a scan remotely (also accepts{"scope": "/mnt/photos/2024"})GET /status— current state and file counts as JSON
Troubleshooting
New camera RAW files show unsupported
LibRaw adds new cameras gradually. unsupported is not corruption — the file is fine, just unrecognised. Fix: pip install -U rawpy.
Many unreadable files suddenly
Almost always a mount going offline or a permission change. SemanticDog flags this as a suspected mount failure in the notification if more than half the scan is unreadable.
HEIC not validating
Needs pillow-heif: pip install pillow-heif. Run sdog check-deps to see everything that's missing at once.
Video not validating
Needs ffmpeg: apt install ffmpeg / brew install ffmpeg.
Moved your library to a new path
sdog db-export -o backup.json
sdog db-import -i backup.json --path-map /old/path:/new/path
AI Agent Reference — structured data for agents and tooling
Project identity
name: semanticdog
binary: sdog
module: semanticdog
python: >=3.12
entrypoint: semanticdog/cli.py
Repository layout
semanticdog/
cli.py CLI — all commands (typer)
config.py Config dataclass + load_config() + env override
db.py Database — SQLite WAL, all queries
scanner.py Scanner + walk_paths() + _validate_file() pebble worker
server.py FastAPI — /health /metrics /status /trigger + build_app()
notify.py ScanSummary, Notifier, SmtpNotifier, WebhookNotifier
mcp_server.py MCP SSE transport
exceptions.py ConfigError, DatabaseError, LockError
validators/
__init__.py registry: register(), get_validator(), all_extensions()
base.py BaseValidator, ValidationResult, DependencyReport
images.py JpegValidator PngValidator TiffValidator HeicValidator WebpValidator
raw.py RawValidator
documents.py PdfValidator OoxmlValidator OleValidator
media.py VideoValidator AudioValidator
tests/
fixtures/generators.py make_minimal_jpeg, make_corrupt_jpeg, make_minimal_png, ...
test_e2e.py 37 end-to-end tests (no mocks, real files)
test_server.py / test_scanner.py / test_db.py / test_notify.py / test_*.py
CLI exit codes
sdog scan: 0 = all OK · 1 = config/DB/scan error · 2 = corrupt or unreadable files found · 130 = interrupted (Ctrl+C)
sdog check-deps: 0 = all hard deps present · 1 = hard dep missing
HTTP API
GET /health → 200 {"status":"ok"}
GET /status → 200 {status, files_indexed, by_status, last_scan}
GET /metrics → 200 Prometheus text
POST /trigger → 200 {status:"complete", scan_id}
400 scope outside configured roots
409 scan already running
429 cooldown {retry_after_s}
503 not configured
GET /mcp/sse → SSE stream (requires mcp_enabled=true + mcp_auth_token)
Config keys → env vars
| Key | Env var | Default |
|---|---|---|
paths |
SDOG_PATHS (colon-sep) |
[] |
exclude |
SDOG_EXCLUDE (colon-sep) |
["**/@eaDir/**", ...] |
db_path |
SDOG_DB_PATH |
/data/state/state.db |
workers |
SDOG_WORKERS |
4 |
raw_workers |
SDOG_RAW_WORKERS |
2 |
raw_decode_depth |
SDOG_RAW_DECODE_DEPTH |
structure |
validation_timeout_s |
SDOG_VALIDATION_TIMEOUT_S |
120 |
force_recheck_days |
SDOG_FORCE_RECHECK_DAYS |
90 |
http_port |
SDOG_HTTP_PORT |
9090 |
notify_email |
SDOG_NOTIFY_EMAIL |
"" |
smtp_pass |
SDOG_SMTP_PASS |
"" |
webhook_url |
SDOG_WEBHOOK_URL |
"" |
mcp_enabled |
SDOG_MCP_ENABLED |
false |
mcp_auth_token |
SDOG_MCP_AUTH_TOKEN |
"" |
mcp_allow_write |
SDOG_MCP_ALLOW_WRITE |
false |
mcp_rate_limit_s |
SDOG_MCP_RATE_LIMIT_S |
60 |
Database schema
files (
path TEXT PRIMARY KEY,
mtime REAL, size INTEGER,
status TEXT, -- ok|corrupt|unreadable|unsupported|error
error TEXT, suggested_action TEXT,
checked_at TEXT, -- ISO 8601
scan_id TEXT,
notified_at TEXT -- NULL = not yet notified
)
scans (
id TEXT PRIMARY KEY,
started_at TEXT, finished_at TEXT, -- finished_at NULL = incomplete/resumable
total INTEGER, corrupt INTEGER, unreadable INTEGER,
scope TEXT, -- NULL = all paths
files_per_sec REAL
)
scan_queue (
scan_id TEXT, path TEXT,
done INTEGER DEFAULT 0 -- 0 = pending, 1 = complete
)
Key internal APIs
from semanticdog.config import load_config
from semanticdog.db import Database
from semanticdog.scanner import Scanner
cfg = load_config("config.yaml") # YAML + env override
db = Database(cfg.db_path)
stats = Scanner(cfg, db).scan() # all paths → ScanStats
stats = Scanner(cfg, db).scan(["/sub"]) # scoped scan
stats = Scanner(cfg, db).scan(resume_scan_id="abc123-…") # resume interrupted scan
# stats.scan_id — ID of the scan just run (for resume)
db.get_corrupt_files(since="2025-01-01", ext="cr2", path_prefix="/mnt")
db.get_stats() # {"total": N, "total_size_bytes": N, "by_status": {...}}
db.get_format_counts() # [(ext, count), ...] sorted by count
db.get_stale_count(days=90)
db.get_top_errors(limit=5)
db.get_scan(scan_id) # single scan record dict
db.list_scans(limit=10)
db.export_json()
db.import_json(records, force=False, path_map={"/old": "/new"})
Running tests
uv run pytest # 427 tests
uv run pytest tests/test_e2e.py -v # E2E only (real files, no mocks)
Known limitations
- RAW
unsupported≠ corrupt — LibRaw doesn't cover all camera models - HEIC: primary frame only; burst/live photo secondary frames skipped
- Sidecars (
.XMP,.AAE): validated independently, no pair correlation verify-hashescommand: not yet implemented
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semanticdog-0.2.0.tar.gz.
File metadata
- Download URL: semanticdog-0.2.0.tar.gz
- Upload date:
- Size: 144.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
346e1cb1b876603c6bf077c0902445d1d345f4dfa3d4afbe8edcf3a3cea4c600
|
|
| MD5 |
5cfe6f1e76fe5105c3db65a5710ef1f7
|
|
| BLAKE2b-256 |
e35c06bceb7b8509243096f95e2fd13b934f72aa628cdf6c3a1498f83107216c
|
Provenance
The following attestation bundles were made for semanticdog-0.2.0.tar.gz:
Publisher:
release.yml on kytmanov/semantic-dog
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semanticdog-0.2.0.tar.gz -
Subject digest:
346e1cb1b876603c6bf077c0902445d1d345f4dfa3d4afbe8edcf3a3cea4c600 - Sigstore transparency entry: 1305088397
- Sigstore integration time:
-
Permalink:
kytmanov/semantic-dog@57dbada068d8fe5c341cc87f868623900546fe39 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/kytmanov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@57dbada068d8fe5c341cc87f868623900546fe39 -
Trigger Event:
push
-
Statement type:
File details
Details for the file semanticdog-0.2.0-py3-none-any.whl.
File metadata
- Download URL: semanticdog-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32d458f85c3f454392943ff8f67b66d4a1fcfaebe5dcecddcd2d4bd742fe1ed6
|
|
| MD5 |
c33eea617f8aa196194e1d194723eba6
|
|
| BLAKE2b-256 |
57195a44dd86cbfd99fde86a07c27eb71285abc23ce9edfb8370ca05a8cc3177
|
Provenance
The following attestation bundles were made for semanticdog-0.2.0-py3-none-any.whl:
Publisher:
release.yml on kytmanov/semantic-dog
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
semanticdog-0.2.0-py3-none-any.whl -
Subject digest:
32d458f85c3f454392943ff8f67b66d4a1fcfaebe5dcecddcd2d4bd742fe1ed6 - Sigstore transparency entry: 1305088547
- Sigstore integration time:
-
Permalink:
kytmanov/semantic-dog@57dbada068d8fe5c341cc87f868623900546fe39 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/kytmanov
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@57dbada068d8fe5c341cc87f868623900546fe39 -
Trigger Event:
push
-
Statement type: