Verify subtitle files match video audio content
Project description
submatch
Verify that a subtitle file matches the audio content of a video.
Subtitle download tools (like subliminal) sometimes return correctly-timed but wrong-content subtitles — a different episode, a different release, or the wrong language track. submatch catches this by transcribing short audio segments with Whisper and comparing against the subtitle text using token F1 scoring.
submatch video.mkv subtitle.en.srt
PASS ✓ 0.61 (thr 0.35 · base · 5 segs)
lang audio=en · sub=en
sync no drift ✓
#1 00:04:12 0.68 ██████░░
#2 00:18:44 0.55 ████░░░░
Install
pip install submatch
ffmpeg is bundled automatically. Whisper model weights download on first run.
Usage
Single pair:
submatch video.mkv subtitle.en.srt
submatch video.mkv subtitle.pt.srt --model small --threshold 0.4 --verbose
submatch video.mkv subtitle.en.srt --no-sync --json report.json
Auto-discover — pass what you have:
submatch video.mkv # find all subtitles alongside the video
submatch subtitle.en.srt # find the video alongside the subtitle
submatch v1.mkv v2.mkv # each video finds its own subtitles
submatch s1.srt s2.srt # each subtitle finds its own video
submatch video.mkv s1.srt s2.srt # explicit subtitles for one video
Batch mode — directory of paired files:
submatch /media/movies/ # recursive by default; pairs each video with its subtitles
submatch /media/movies/ --compact # one line per pair
submatch /media/movies/ --json results.json # machine-readable JSON array
submatch /media/movies/ --no-recursive # flat directory only
Batch mode — one video against a subtitle directory:
submatch movie.mkv subs/ # scores every subtitle in subs/ against movie.mkv
Embedded subtitles — score subtitle tracks in the video container:
submatch --embedded movie.mkv
submatch --embedded /path/to/library/
Watch mode — monitor a directory for new pairs:
submatch --watch /media/movies/
submatch --watch /media/movies/ --sub-lang en --delete-failures
submatch --watch /media/movies/ --poll # for network mounts (NFS, SMB)
submatch --watch /media/movies/ --poll --interval 30
Filtering — process only specific subtitles:
submatch /media/shows/ --sub-lang pt # matches pt.srt, pt-BR.srt, pt-PT.srt
submatch /media/shows/ --sub-lang en --sub-lang pt-BR # multiple codes
submatch movie.mkv subs/ --filter "*.en.*" # glob on subtitle filename
submatch /media/shows/ --sub-lang pt --filter "*.srt" # both must pass
Cross-language matching
When a segment's detected audio language differs from the subtitle language, submatch automatically switches that segment from token F1 scoring to multilingual semantic similarity using paraphrase-multilingual-MiniLM-L12-v2. This happens per-segment, so dubbed or mixed-language files are handled correctly even if not every segment is cross-language. Cross-language segments use a default threshold of 0.20 (instead of the 0.35 default for same-language segments) because semantic similarity scores across language pairs are inherently lower even for correct matches.
Use --cross-threshold to tune the pass/fail cutoff for translated subtitles independently:
submatch movie.mkv movie.pt.srt --cross-threshold 0.5
The model is downloaded on first use (~90 MB) and cached by sentence-transformers.
Supported subtitle formats
SRT, WebVTT, ASS/SSA (and any other format supported by pysubs2).
Image-based subtitles (VOBSUB .sub / PGS .sup) are supported via OCR. pytesseract is included with pip install submatch, but the Tesseract engine itself must be installed separately — see the Tesseract installation guide for your platform.
Language is detected automatically from the filename or video metadata; Tesseract's OSD is used as a fallback. Only the time windows that Whisper transcribes are OCR'd — not the full subtitle stream.
Language support
✓ = confirmed by integration tests · ~ = supported by underlying tools, not yet integration-tested
| Language | Audio | Subtitle |
|---|---|---|
| Arabic | ✓ | ✓ |
| Basque | ~ | ~ |
| Bulgarian | ✓ | ✓ |
| Catalan | ✓ | ✓ |
| Chinese (Simplified) | ✓ | ✓ |
| Croatian | ✓ | ✓ |
| Czech | ✓ | ✓ |
| Danish | ✓ | ✓ |
| Dutch | ✓ | ✓ |
| English | ✓ | ✓ |
| Estonian | ✓ | ✓ |
| Filipino | ~ | ~ |
| Finnish | ✓ | ✓ |
| French | ✓ | ✓ |
| Galician | ✓ | ✓ |
| German | ✓ | ✓ |
| Greek | ✓ | ✓ |
| Hebrew | ✓ | ✓ |
| Hindi | ✓ | ✓ |
| Hungarian | ✓ | ✓ |
| Indonesian | ✓ | ✓ |
| Italian | ✓ | ✓ |
| Japanese | ✓ | ✓ |
| Kannada | ✓ | ✓ |
| Korean | ✓ | ✓ |
| Latvian | ✓ | ✓ |
| Lithuanian | ✓ | ✓ |
| Malay | ✓ | ✓ |
| Malayalam | ~ | ✓ |
| Neapolitan | ✓ | ✓ |
| Norwegian | ✓ | ✓ |
| Polish | ✓ | ✓ |
| Portuguese | ✓ | ✓ |
| Portuguese (Brazil) | ✓ | ✓ |
| Romanian | ✓ | ✓ |
| Russian | ✓ | ✓ |
| Slovak | ✓ | ✓ |
| Slovenian | ✓ | ✓ |
| Spanish | ✓ | ✓ |
| Swedish | ✓ | ✓ |
| Tamil | ~ | ~ |
| Telugu | ~ | ~ |
| Thai | ✓ | ✓ |
| Turkish | ✓ | ✓ |
| Ukrainian | ✓ | ✓ |
| Vietnamese | ✓ | ✓ |
Audio — Whisper can transcribe the spoken language. Chinese (Simplified) is tested via Shanghainese and Guiyangese speakers; standard Mandarin is expected to work. Basque and Filipino consistently score below the cross-language threshold with the base model across all tested content. Tamil and Telugu score below threshold in most content but pass in some; use --model small or larger for more reliable results with these languages. Subtitle — submatch can score a subtitle in that language using token F1 (same-language) or multilingual sentence embeddings (cross-language, via paraphrase-multilingual-MiniLM-L12-v2).
Options
| Flag | Default | Description |
|---|---|---|
--model |
base |
Whisper model: tiny, base, small, medium, large |
--threshold |
0.35 |
Pass/fail confidence cutoff (0–1) |
--cross-threshold |
0.20 |
Pass/fail threshold for cross-language pairs |
--segments |
auto | Number of audio segments to sample |
--audio-track |
0 |
Audio track to use: integer index (0-based) or comma-separated language preference list (jp,en,pt). Default: track 0. |
--embedded |
off | Score embedded subtitle tracks in the video container instead of external files |
--language |
auto | Expected audio language (e.g. en, pt) |
--drift-threshold |
2.0 |
Seconds of timing offset before flagging as drift |
--no-sync |
off | Skip ffsubsync timing drift check |
--keep-synced |
off | Save timing-corrected subtitle to disk |
--no-recursive |
off | Do not recurse into subdirectories when expanding directories (default: recursive) |
--sub-lang CODE |
off | Keep only subtitles whose filename language code starts with CODE (repeatable; infers from text for untagged external files; always includes untagged embedded tracks) |
--filter GLOB |
off | Keep only subtitles whose filename matches the glob (e.g. *.en.*) |
--json FILE |
off | Write JSON report to FILE |
--csv FILE |
off | Write CSV report to FILE |
--html FILE |
off | Write self-contained HTML report to FILE |
--compact |
off | One-line-per-pair summary in batch mode |
--verbose |
off | Show subtitle and transcription text per segment |
--device |
auto |
Whisper inference device: cpu, mps (Apple Silicon), cuda (NVIDIA), auto (CUDA > CPU; use --device mps explicitly on Apple Silicon) |
--workers |
auto |
Parallel pairs in batch mode; auto selects up to 4 |
--delete-failures |
off | Delete subtitle files that fail the match check |
--resync |
off | On DRIFT (drift detected), copy synced subtitle over original and re-score |
--pass-unsure |
off | Exit 0 for UNSURE results (not enough transcription data) |
--no-cache |
off | Disable transcription cache and use subtitle-driven segment selection |
--clear-cache |
off | Delete all cached transcriptions and exit |
--timing |
off | Print per-phase timing breakdown (single-pair mode only) |
--watch |
off | Monitor a directory for new video/subtitle pairs and score them as they appear |
--poll |
off | Use polling instead of native filesystem events (required for network mounts) |
--interval N |
10 |
Seconds between directory scans in --poll mode |
Segment count auto-selection: < 30 min → 5, 30–90 min → 8, > 90 min → 12.
Breaking change: --json now requires a filename. Bare --json is a parse error. Update scripts from --json to --json output.json. The same applies to --csv and --html.
Configuration
submatch reads defaults from two TOML config files, merged in order:
~/.config/submatch/config.toml— personal defaults applied everywhere./submatch.toml— directory-level defaults (overrides user config)
CLI flags always override both.
Example ~/.config/submatch/config.toml:
model = "small"
threshold = 0.40
language = "en"
workers = 2
Configurable flags: model, threshold, segments, language, no_sync, keep_synced, no_recursive, sub_lang, filter, device, workers, delete_failures, cross_threshold, resync, pass_unsure, drift_threshold, audio_track, cache_ttl_days, cache_max_mb, cache_dir
Note: Boolean flags set to
truein config (e.g.no_sync = true) cannot be overridden back tofalsevia the CLI — remove the line from your config instead.Warning:
delete_failures = truewill silently delete subtitle files on every run. Use with care.
Telemetry
submatch reports crashes and unexpected pipeline errors to Sentry to help improve
reliability. No file paths or personal data are transmitted — all path strings are
replaced with <path> before sending.
To opt out, set an environment variable:
export SUBMATCH_NO_TELEMETRY=1
Or add to ~/.config/submatch/config.toml or ./submatch.toml:
telemetry = false
How it works
- Sync — runs
ffs(ffsubsync) to correct timing drift; flags offsets > 2 s - Sample — divides the video into N zones (skipping first/last 5%). By default, uses ffmpeg
silencedetectto find speech-rich windows (audio-driven mode); picks the best 30-second window per zone based on speech coverage, with a quality gate that retries if Whisper confidence is low. Use--no-cachefor the original subtitle-driven path (highest word-count window per zone). - Cache — validated transcriptions are stored in
~/.cache/submatch/keyed by video path, mtime, model, and segment count. Subsequent runs on the same video skip audio extraction and transcription entirely. Cache is evicted after 30 days or when it exceeds 200 MB (LRU). Use--no-cacheto bypass or--clear-cacheto wipe. - Transcribe — extracts each window as a 16 kHz mono WAV and transcribes with Whisper
- Score — normalises both texts (lowercase, strip punctuation, remove fillers), computes token F1 per segment, returns a weighted average
- Report — prints confidence, language signals, and drift; exits 0/1/2
The default threshold of 0.35 is intentionally low — subtitle text often paraphrases rather than quoting verbatim.
States and exit codes
Each pair is assigned one of four states:
| State | Meaning | Exit code |
|---|---|---|
PASS |
Content matches, no timing drift | 0 |
DRIFT |
Content matches, but timing drift detected | 1 (use --resync to fix in place) |
FAIL |
Content does not match | 1 |
UNSURE |
Not enough transcription data to decide | 1 (use --pass-unsure to exit 0) |
| — | Error (missing dependency, unreadable file, no audio track) | 2 |
Acknowledgements
submatch is a complement to the existing subtitle ecosystem, not a replacement for it. It wouldn't exist without:
- openai/whisper — the speech recognition engine that powers transcription
- smacke/ffsubsync — timing drift correction used before scoring
- tkarabela/pysubs2 — multi-format subtitle parsing (SRT, VTT, ASS/SSA)
- UKPLab/sentence-transformers — multilingual embeddings for cross-language scoring
- Diaoul/subliminal and morpheus65535/bazarr — the subtitle download tools that
submatchis designed to work alongside
Limitations
- Runs Whisper locally — no API key needed. Model weights download on first run.
- Cross-language scoring uses multilingual sentence embeddings and is less precise than same-language token F1 — consider lowering
--cross-thresholdif you get too many false negatives.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file submatch-0.7.0.tar.gz.
File metadata
- Download URL: submatch-0.7.0.tar.gz
- Upload date:
- Size: 102.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c74cd31460c620c283582323c802c2a232eef631f8ad0cc37305276faf4c2ddc
|
|
| MD5 |
5ce7c9417cf83d74db41c9ecf8fbcad0
|
|
| BLAKE2b-256 |
9769b76f5a62b5a183d99e580e40006e0da6f83065b59e3548c027f12d862366
|
File details
Details for the file submatch-0.7.0-py3-none-any.whl.
File metadata
- Download URL: submatch-0.7.0-py3-none-any.whl
- Upload date:
- Size: 51.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bc0523d034b5a88a04b96bc33df2bcee2c812d939841dccbf8a286ab7abdada
|
|
| MD5 |
7a2009a6d57945b053ed0afbdb3edad4
|
|
| BLAKE2b-256 |
b2e462d63636a722d6ebc2406bf66d3c79fbaab5bd6d4f53fa7a348ed6190201
|