Skip to main content

A local-first subtitle sanitizer, readability optimizer, and NLE import formatter

Project description

subtitle-cleaner

A local-first Python CLI toolkit for sanitizing, repairing, and optimizing AI-generated subtitle files.

AI transcription tools (Whisper, CapCut, Premiere's auto-transcribe) save time up front — but the cleanup afterward is brutal. Stuttered words, broken import syntax, filler noise, and ugly line breaks pile up fast.

subtitle-cleaner runs entirely on your machine with no cloud dependencies, no privacy risk, and no subscription.


✨ Features

Feature Flag Description
Duplicate word removal (default) Case-insensitively removes stutters ("We're we're""We're")
Filler word stripping (default) Removes um, uh, like without breaking timings
HTML/font tag stripping (default) Cleans embedded <font>, <i>, <b> tags from SRT/VTT
Arrow & timestamp repair (default) Normalizes unicode arrows (–>, ->) to standard -->
Index re-sequencing (default) Fixes out-of-order subtitle block indices starting from 1
Timestamp normalization (default) Pads and corrects malformed HH:MM:SS,mmm timestamps
Repair preview (diff) --preview / -p Non-destructive diff of proposed changes before saving
Semantic line breaking --segment / -s Splits long lines at grammatical boundaries (not character counts)
Mobile formatting --mobile / -m 30-character line width for 9:16 vertical video (TikTok/Shorts/Reels)
Custom vocabulary map --vocab vocab.json JSON-based find-and-replace for brand names, jargon, speaker names
NLE optimization --nle premiere|resolve Premiere Pro (37-char limit, 2-line max) and Resolve (UTF-8 BOM) modes
Batch processing -i /folder/ -o /out/ Recursively processes entire directories of .srt/.vtt files
Format conversion --format srt|vtt|ass / -f srt|vtt|ass Convert between SRT, VTT, and ASS formats (converts timestamps & styles)
Word-level splitting --word-split / -w Splits blocks into single-word timed subtitle blocks (proportionally distributed duration)
Karaoke style export --karaoke / -k Exports to ASS format with highlighting {\k} centisecond timings
YouTube caption sync youtube_sync.py Downloads, cleans, and saves captions from any YouTube URL

🚀 Installation

# Install directly from GitHub
pip install git+https://github.com/r1ngotchi/subtitle-cleaner.git

# Or clone and install in editable mode for development
git clone https://github.com/r1ngotchi/subtitle-cleaner
cd subtitle-cleaner
pip install -e .

💻 Quick Start

Once installed, the toolkit provides three convenient CLI commands:

# Preview what would change (non-destructive)
subtitle-cleaner -i messy.srt --preview

# Clean and save output
subtitle-cleaner -i messy.srt -o clean.srt

# Clean for Premiere Pro import (37-char line limit, 2-line max)
subtitle-cleaner -i messy.srt -o clean.srt --nle premiere

# Clean for DaVinci Resolve (UTF-8 BOM encoding)
subtitle-cleaner -i messy.srt -o clean.srt --nle resolve

# Semantic line breaking + mobile formatting
subtitle-cleaner -i messy.srt -o clean.srt --segment --mobile

# Apply custom vocabulary corrections
subtitle-cleaner -i messy.srt -o clean.srt --vocab my_vocab.json

# Batch process an entire folder
subtitle-cleaner -i ./subtitles/ -o ./cleaned/

# Convert SRT to VTT (strips indices, normalizes dots)
subtitle-cleaner -i input.srt -o output.vtt -f vtt

# Convert VTT to SRT (restores sequential indices, normalizes commas)
subtitle-cleaner -i input.vtt -o output.srt -f srt

# Split subtitles into single-word blocks (e.g. for vertical video captions)
subtitle-cleaner -i input.srt -o output.srt -w

# Convert to ASS format with karaoke highlighting (for animated captions)
subtitle-cleaner -i input.srt -o output.ass -k

# Download and clean YouTube captions directly
subtitle-youtube-sync https://www.youtube.com/watch?v=VIDEO_ID -o output.vtt

📋 Before / After Example

Input (messy.srt):

1
00:00:01,000 -> 00:00:04,000
Yeah, we're we're going to like, um, build this.

1
00:00:04,100 ---> 00:00:06,000
Like, uh, <font color="#ff0000">absolutely.</font>

Output after python cleaner.py -i messy.srt:

1
00:00:01,000 --> 00:00:04,000
Yeah, we're going to build this.

2
00:00:04,100 --> 00:00:06,000
Absolutely.

📦 Custom Vocabulary Map (--vocab)

Create a vocab.json file mapping AI mistranscriptions to their correct forms:

{
  "open eye": "OpenAI",
  "adobe premiere pro": "Adobe Premiere Pro",
  "da vinci resolve": "DaVinci Resolve"
}

Then run: python cleaner.py -i messy.srt -o clean.srt --vocab vocab.json


🔬 Diagnostics

Run the linter to get a full report of issues before cleaning:

subtitle-diagnostics messy.srt

The linter checks for:

  • Reading speed — flags blocks with dangerously high CPS (characters per second)
  • NLE compatibility — warns about 3+ line blocks and lines >37 chars (Premiere crash risk)
  • Timing overlaps — detects blocks where end time > next block's start time
  • Whitespace corruption — tabs, trailing spaces, CRLF issues

⚙️ YouTube Caption Sync

subtitle-youtube-sync https://www.youtube.com/watch?v=VIDEO_ID -o captions.vtt

Downloads auto-generated or manually uploaded captions, cleans them, and saves a polished file ready for upload or NLE import.


🧪 Tests

Unit Tests

Verify individual module features and parser stability:

python -m unittest test_cleaner.py
# Expected: 19 tests, all passing

Regression Tests

Test the cleaner automatically against all collected real-world corruptions in the dataset:

subtitle-regression
# Expected: Runs and passes all dataset test cases with 0 critical errors remaining

📁 Project Structure

subtitle-cleaner/
├── cleaner.py          # Core CLI tool
├── diagnostics.py      # Linting & validation engine
├── youtube_sync.py     # YouTube caption downloader-cleaner
├── detectors/          # Modular lint checkers
│   ├── reading_speed.py
│   ├── nle_compatibility.py
│   └── ...
├── sample_input.srt    # Test fixture
├── sample_vocab.json   # Example vocabulary map
└── test_cleaner.py     # 15 unit + integration tests

🤝 Contributing

Found a subtitle file that breaks the parser? Submit an issue and attach the file — we're building a structured corruption dataset to improve repair reliability.

Pull requests welcome. See CONTRIBUTING.md for guidelines.


☕ Support & Funding

If subtitle-cleaner saves you hours of manual editing, consider supporting the project:


📄 License

MIT — free to use, modify, and distribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subtitle_cleaner-0.3.3.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

subtitle_cleaner-0.3.3-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file subtitle_cleaner-0.3.3.tar.gz.

File metadata

  • Download URL: subtitle_cleaner-0.3.3.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for subtitle_cleaner-0.3.3.tar.gz
Algorithm Hash digest
SHA256 5d55703de0b9b4d68c159b4cfd563fe14f73d70e2ffe23abc58d368faa0ef3d4
MD5 a78a6c39be32ad7da4c83e6e2a6a7452
BLAKE2b-256 9de8db3b2f270065ff08fa9f2a55920ed6c04c5e1efc9917644a0befd90dcd29

See more details on using hashes here.

File details

Details for the file subtitle_cleaner-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for subtitle_cleaner-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 148948a6baec1fcde7abbe55601a6cc084909904ae4cf7b3a4eb798851dc708a
MD5 d187c4bf4f198f982f8a970baf9a60f5
BLAKE2b-256 50028d9352db6107e6a4b50a70e1402fd5d91b32cbc69c04650d885c49cd6573

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page