Skip to main content

Verify subtitle files match video audio content

Project description

submatch

PyPI version Python versions License

Verify that a subtitle file matches the audio content of a video.

Subtitle download tools (like subliminal) sometimes return correctly-timed but wrong-content subtitles — a different episode, a different release, or the wrong language track. submatch catches this by transcribing short audio segments with Whisper and comparing against the subtitle text using token F1 scoring.

submatch video.mkv subtitle.en.srt

PASS ✓  0.61  (thr 0.35 · base · 5 segs)
lang  audio=en  ·  sub=en
sync  no drift  ✓
  #1  00:04:12  0.68  ██████░░
  #2  00:18:44  0.55  ████░░░░

Install

pip install -e .

System dependencies (must be on PATH):

# macOS
brew install ffmpeg
pip install ffsubsync   # provides the 'ffs' command

Whisper model weights download automatically on first run.

Usage

Single file:

submatch video.mkv subtitle.en.srt
submatch video.mkv subtitle.pt.srt --model small --threshold 0.4 --verbose
submatch video.mkv subtitle.en.srt --no-sync --json

Batch mode — directory of paired files:

submatch /media/movies/            # pairs each video with its same-stem subtitle
submatch /media/movies/ --compact  # one line per pair
submatch /media/movies/ --json     # machine-readable JSON array

Batch mode — one video against a subtitle directory:

submatch movie.mkv subs/           # scores every subtitle in subs/ against movie.mkv

Recursive — walk nested directory trees:

submatch /media/series/ --recursive          # Plex/Kodi library layout
submatch movie.mkv subs/ -r                  # recurse into subs/ subdirectories

Filtering — process only specific subtitles:

submatch /media/natal/ --sub-lang pt          # matches pt.srt, pt-BR.srt, pt-PT.srt
submatch /media/natal/ --sub-lang en --sub-lang pt-BR   # multiple codes
submatch movie.mkv subs/ --filter "*.en.*"    # glob on subtitle filename
submatch /media/natal/ --sub-lang pt --filter "*.srt"   # both must pass

Cross-language matching

When the subtitle language differs from the audio language (e.g. English audio with Portuguese subtitles), submatch automatically switches from token F1 scoring to multilingual semantic similarity using paraphrase-multilingual-MiniLM-L12-v2. The score is normalized so the same --threshold applies to both same-language and cross-language pairs.

Use --cross-threshold to tune the pass/fail cutoff for translated subtitles independently:

submatch movie.mkv movie.pt.srt --cross-threshold 0.5

The model is downloaded on first use (~90 MB) and cached by sentence-transformers.

Supported subtitle formats

SRT, WebVTT, ASS/SSA (and any other format supported by pysubs2).

Options

Flag Default Description
--model base Whisper model: tiny, base, small, medium, large
--threshold 0.35 Pass/fail confidence cutoff (0–1)
--cross-threshold same as --threshold Pass/fail threshold for cross-language pairs
--segments auto Number of audio segments to sample
--language auto Expected audio language (e.g. en, pt)
--no-sync off Skip ffsubsync timing drift check
--keep-synced off Save timing-corrected subtitle to disk
--recursive, -r off Walk nested directories in batch mode
--sub-lang CODE off Keep only subtitles whose filename language code starts with CODE (repeatable; infers from text for untagged files)
--filter GLOB off Keep only subtitles whose filename matches the glob (e.g. *.en.*)
--json off Machine-readable JSON output
--compact off One-line-per-pair summary in batch mode
--verbose off Show subtitle and transcription text per segment
--device auto Whisper inference device: cpu, mps (Apple Silicon), cuda (NVIDIA), auto
--workers auto Parallel pairs in batch mode; auto selects 1 for GPU, up to 4 for CPU
--delete-failures off Delete subtitle files that fail the match check
--resync off On WARN (drift detected), copy synced subtitle over original and re-score
--pass-unsure off Exit 0 for UNSURE results (not enough transcription data)

Segment count auto-selection: < 30 min → 5, 30–90 min → 8, > 90 min → 12.

How it works

  1. Sync — runs ffs (ffsubsync) to correct timing drift; flags offsets > 2 s
  2. Sample — divides the video into N zones (skipping first/last 5%), picks the 30-second window with the most subtitle words per zone
  3. Transcribe — extracts each window as a 16 kHz mono WAV and transcribes with Whisper
  4. Score — normalises both texts (lowercase, strip punctuation, remove fillers), computes token F1 per segment, returns a weighted average
  5. Report — prints confidence, language signals, and drift; exits 0/1/2

The default threshold of 0.35 is intentionally low — subtitle text often paraphrases rather than quoting verbatim.

States and exit codes

Each pair is assigned one of four states:

State Meaning Exit code
PASS Content matches, no timing drift 0
WARN Content matches, but timing drift detected 1 (use --resync to fix in place)
FAIL Content does not match 1
UNSURE Not enough transcription data to decide 1 (use --pass-unsure to exit 0)
Error (missing dependency, unreadable file, no audio track) 2

Acknowledgements

submatch is a complement to the existing subtitle ecosystem, not a replacement for it. It wouldn't exist without:

Limitations

  • Requires a local Whisper install (pip install openai-whisper). No API key needed.
  • Cross-language scoring uses multilingual sentence embeddings and is less precise than same-language token F1 — consider lowering --cross-threshold if you get too many false negatives.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

submatch-0.1.0.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

submatch-0.1.0-py3-none-any.whl (23.4 kB view details)

Uploaded Python 3

File details

Details for the file submatch-0.1.0.tar.gz.

File metadata

  • Download URL: submatch-0.1.0.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for submatch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 45f990c20ce7fd6d6749bc49aeaca77df3555c8ed903be79beb2b402f3a5c158
MD5 a763d285eabac1d0d705ec6b0ef3a9dd
BLAKE2b-256 16f51961d5c6b7681d866264bb42b9a9116ec5e1758d76b782701f0658cb7e6c

See more details on using hashes here.

File details

Details for the file submatch-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: submatch-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for submatch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1f3650e51bfb52e749702744c831f6d0d92b07977e621540d404d76cf4f38573
MD5 0ff1d1350dbf835e544162e6dbb2cab9
BLAKE2b-256 78e3b28bc1a9cefd9938d62b3a995f341c6aa69512d3ed7a6be1fa1df419bcdf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page