Verify subtitle files match video audio content

These details have not been verified by PyPI

Project links

Project description

submatch

Verify that a subtitle file matches the audio content of a video.

Subtitle download tools (like subliminal) sometimes return correctly-timed but wrong-content subtitles — a different episode, a different release, or the wrong language track. submatch catches this by transcribing short audio segments with Whisper and comparing against the subtitle text using token F1 scoring.

submatch video.mkv subtitle.en.srt

PASS ✓  0.61  (thr 0.35 · base · 5 segs)
lang  audio=en  ·  sub=en
sync  no drift  ✓
  #1  00:04:12  0.68  ██████░░
  #2  00:18:44  0.55  ████░░░░

Install

pip install submatch

ffmpeg is bundled automatically. Whisper model weights download on first run.

Usage

Single pair:

submatch video.mkv subtitle.en.srt
submatch video.mkv subtitle.pt.srt --model small --threshold 0.4 --verbose
submatch video.mkv subtitle.en.srt --no-sync --json

Auto-discover — pass what you have:

submatch video.mkv              # find all subtitles alongside the video
submatch subtitle.en.srt        # find the video alongside the subtitle
submatch v1.mkv v2.mkv          # each video finds its own subtitles
submatch s1.srt s2.srt          # each subtitle finds its own video
submatch video.mkv s1.srt s2.srt  # explicit subtitles for one video

Batch mode — directory of paired files:

submatch /media/movies/            # recursive by default; pairs each video with its subtitles
submatch /media/movies/ --compact  # one line per pair
submatch /media/movies/ --json     # machine-readable JSON array
submatch /media/movies/ --no-recursive  # flat directory only

Batch mode — one video against a subtitle directory:

submatch movie.mkv subs/           # scores every subtitle in subs/ against movie.mkv

Filtering — process only specific subtitles:

submatch /media/natal/ --sub-lang pt          # matches pt.srt, pt-BR.srt, pt-PT.srt
submatch /media/natal/ --sub-lang en --sub-lang pt-BR   # multiple codes
submatch movie.mkv subs/ --filter "*.en.*"    # glob on subtitle filename
submatch /media/natal/ --sub-lang pt --filter "*.srt"   # both must pass

Cross-language matching

When the subtitle language differs from the audio language (e.g. English audio with Portuguese subtitles), submatch automatically switches from token F1 scoring to multilingual semantic similarity using paraphrase-multilingual-MiniLM-L12-v2. The score is normalized so the same --threshold applies to both same-language and cross-language pairs.

Use --cross-threshold to tune the pass/fail cutoff for translated subtitles independently:

submatch movie.mkv movie.pt.srt --cross-threshold 0.5

The model is downloaded on first use (~90 MB) and cached by sentence-transformers.

Supported subtitle formats

SRT, WebVTT, ASS/SSA (and any other format supported by pysubs2).

Options

Flag	Default	Description
`--model`	`base`	Whisper model: `tiny`, `base`, `small`, `medium`, `large`
`--threshold`	`0.35`	Pass/fail confidence cutoff (0–1)
`--cross-threshold`	same as `--threshold`	Pass/fail threshold for cross-language pairs
`--segments`	auto	Number of audio segments to sample
`--language`	auto	Expected audio language (e.g. `en`, `pt`)
`--drift-threshold`	`2.0`	Seconds of timing offset before flagging as drift
`--no-sync`	off	Skip ffsubsync timing drift check
`--keep-synced`	off	Save timing-corrected subtitle to disk
`--no-recursive`	off	Do not recurse into subdirectories when expanding directories (default: recursive)
`--sub-lang CODE`	off	Keep only subtitles whose filename language code starts with CODE (repeatable; infers from text for untagged files)
`--filter GLOB`	off	Keep only subtitles whose filename matches the glob (e.g. `.en.`)
`--json`	off	Machine-readable JSON output
`--compact`	off	One-line-per-pair summary in batch mode
`--verbose`	off	Show subtitle and transcription text per segment
`--device`	`auto`	Whisper inference device: `cpu`, `mps` (Apple Silicon), `cuda` (NVIDIA), `auto` (CUDA > MPS > CPU)
`--workers`	`auto`	Parallel pairs in batch mode; auto selects up to 4
`--delete-failures`	off	Delete subtitle files that fail the match check
`--resync`	off	On DRIFT (drift detected), copy synced subtitle over original and re-score
`--pass-unsure`	off	Exit 0 for UNSURE results (not enough transcription data)

Segment count auto-selection: < 30 min → 5, 30–90 min → 8, > 90 min → 12.

How it works

Sync — runs ffs (ffsubsync) to correct timing drift; flags offsets > 2 s
Sample — divides the video into N zones (skipping first/last 5%), picks the 30-second window with the most subtitle words per zone
Transcribe — extracts each window as a 16 kHz mono WAV and transcribes with Whisper
Score — normalises both texts (lowercase, strip punctuation, remove fillers), computes token F1 per segment, returns a weighted average
Report — prints confidence, language signals, and drift; exits 0/1/2

The default threshold of 0.35 is intentionally low — subtitle text often paraphrases rather than quoting verbatim.

States and exit codes

Each pair is assigned one of four states:

State	Meaning	Exit code
`PASS`	Content matches, no timing drift	`0`
`DRIFT`	Content matches, but timing drift detected	`1` (use `--resync` to fix in place)
`FAIL`	Content does not match	`1`
`UNSURE`	Not enough transcription data to decide	`1` (use `--pass-unsure` to exit `0`)
—	Error (missing dependency, unreadable file, no audio track)	`2`

Acknowledgements

submatch is a complement to the existing subtitle ecosystem, not a replacement for it. It wouldn't exist without:

openai/whisper — the speech recognition engine that powers transcription
smacke/ffsubsync — timing drift correction used before scoring
tkarabela/pysubs2 — multi-format subtitle parsing (SRT, VTT, ASS/SSA)
UKPLab/sentence-transformers — multilingual embeddings for cross-language scoring
Diaoul/subliminal and morpheus65535/bazarr — the subtitle download tools that submatch is designed to work alongside

Limitations

Requires a local Whisper install (pip install openai-whisper). No API key needed.
Cross-language scoring uses multilingual sentence embeddings and is less precise than same-language token F1 — consider lowering --cross-threshold if you get too many false negatives.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.0

Jun 4, 2026

0.6.0

Jun 3, 2026

0.5.0

Jun 1, 2026

0.4.0

May 31, 2026

0.3.0

May 29, 2026

This version

0.2.0

May 29, 2026

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

submatch-0.2.0.tar.gz (40.9 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

submatch-0.2.0-py3-none-any.whl (24.5 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file submatch-0.2.0.tar.gz.

File metadata

Download URL: submatch-0.2.0.tar.gz
Upload date: May 29, 2026
Size: 40.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for submatch-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`122a32acccf95ae837e07a8eb443af028ca7ffd69235800c910e04570eeb7811`
MD5	`3f314b34524befffff76765a7232ff1f`
BLAKE2b-256	`b1f61a925c56d75d98a674b965f30b2ba6bab5e417597b58ec37981ac1799f02`

See more details on using hashes here.

File details

Details for the file submatch-0.2.0-py3-none-any.whl.

File metadata

Download URL: submatch-0.2.0-py3-none-any.whl
Upload date: May 29, 2026
Size: 24.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for submatch-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`695dca14d48995f9ebd65740505c97616f056506a97c7f66a15e9d746a44f626`
MD5	`e8e26748953956037ec111cb7d3f9aa6`
BLAKE2b-256	`d6edc509f60dd5bff88ae117fd369919be27d44c0982492d5045313d23e7a4a7`

See more details on using hashes here.

submatch 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

submatch

Install

Usage

Cross-language matching

Supported subtitle formats

Options

How it works

States and exit codes

Acknowledgements

Limitations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes