A local-first subtitle sanitizer, readability optimizer, and NLE import formatter
Project description
subtitle-cleaner
A local-first Python CLI toolkit for sanitizing, repairing, and optimizing AI-generated subtitle files.
AI transcription tools (Whisper, CapCut, Premiere's auto-transcribe) save time up front — but the cleanup afterward is brutal. Stuttered words, broken import syntax, filler noise, and ugly line breaks pile up fast.
subtitle-cleaner runs entirely on your machine with no cloud dependencies, no privacy risk, and no subscription.
✨ Features
| Feature | Flag | Description |
|---|---|---|
| Duplicate word removal | (default) | Case-insensitively removes stutters ("We're we're" → "We're") |
| Filler word stripping | (default) | Removes um, uh, like without breaking timings |
| HTML/font tag stripping | (default) | Cleans embedded <font>, <i>, <b> tags from SRT/VTT |
| Arrow & timestamp repair | (default) | Normalizes unicode arrows (–>, ->) to standard --> |
| Index re-sequencing | (default) | Fixes out-of-order subtitle block indices starting from 1 |
| Timestamp normalization | (default) | Pads and corrects malformed HH:MM:SS,mmm timestamps |
| Repair preview (diff) | --preview / -p |
Non-destructive diff of proposed changes before saving |
| Semantic line breaking | --segment / -s |
Splits long lines at grammatical boundaries (not character counts) |
| Mobile formatting | --mobile / -m |
30-character line width for 9:16 vertical video (TikTok/Shorts/Reels) |
| Custom vocabulary map | --vocab vocab.json |
JSON-based find-and-replace for brand names, jargon, speaker names |
| NLE optimization | --nle premiere|resolve |
Premiere Pro (37-char limit, 2-line max) and Resolve (UTF-8 BOM) modes |
| Batch processing | -i /folder/ -o /out/ |
Recursively processes entire directories of .srt/.vtt files |
| Format conversion | --format srt|vtt|ass / -f srt|vtt|ass |
Convert between SRT, VTT, and ASS formats (converts timestamps & styles) |
| Word-level splitting | --word-split / -w |
Splits blocks into single-word timed subtitle blocks (proportionally distributed duration) |
| Karaoke style export | --karaoke / -k |
Exports to ASS format with highlighting {\k} centisecond timings |
| YouTube caption sync | youtube_sync.py |
Downloads, cleans, and saves captions from any YouTube URL |
🚀 Installation
# Install directly from GitHub
pip install git+https://github.com/r1ngotchi/subtitle-cleaner.git
# Or clone and install in editable mode for development
git clone https://github.com/r1ngotchi/subtitle-cleaner
cd subtitle-cleaner
pip install -e .
💻 Quick Start
Once installed, the toolkit provides three convenient CLI commands:
# Preview what would change (non-destructive)
subtitle-cleaner -i messy.srt --preview
# Clean and save output
subtitle-cleaner -i messy.srt -o clean.srt
# Clean for Premiere Pro import (37-char line limit, 2-line max)
subtitle-cleaner -i messy.srt -o clean.srt --nle premiere
# Clean for DaVinci Resolve (UTF-8 BOM encoding)
subtitle-cleaner -i messy.srt -o clean.srt --nle resolve
# Semantic line breaking + mobile formatting
subtitle-cleaner -i messy.srt -o clean.srt --segment --mobile
# Apply custom vocabulary corrections
subtitle-cleaner -i messy.srt -o clean.srt --vocab my_vocab.json
# Batch process an entire folder
subtitle-cleaner -i ./subtitles/ -o ./cleaned/
# Convert SRT to VTT (strips indices, normalizes dots)
subtitle-cleaner -i input.srt -o output.vtt -f vtt
# Convert VTT to SRT (restores sequential indices, normalizes commas)
subtitle-cleaner -i input.vtt -o output.srt -f srt
# Split subtitles into single-word blocks (e.g. for vertical video captions)
subtitle-cleaner -i input.srt -o output.srt -w
# Convert to ASS format with karaoke highlighting (for animated captions)
subtitle-cleaner -i input.srt -o output.ass -k
# Download and clean YouTube captions directly
subtitle-youtube-sync https://www.youtube.com/watch?v=VIDEO_ID -o output.vtt
📋 Before / After Example
Input (messy.srt):
1
00:00:01,000 -> 00:00:04,000
Yeah, we're we're going to like, um, build this.
1
00:00:04,100 ---> 00:00:06,000
Like, uh, <font color="#ff0000">absolutely.</font>
Output after python cleaner.py -i messy.srt:
1
00:00:01,000 --> 00:00:04,000
Yeah, we're going to build this.
2
00:00:04,100 --> 00:00:06,000
Absolutely.
📦 Custom Vocabulary Map (--vocab)
Create a vocab.json file mapping AI mistranscriptions to their correct forms:
{
"open eye": "OpenAI",
"adobe premiere pro": "Adobe Premiere Pro",
"da vinci resolve": "DaVinci Resolve"
}
Then run: python cleaner.py -i messy.srt -o clean.srt --vocab vocab.json
🔬 Diagnostics
Run the linter to get a full report of issues before cleaning:
subtitle-diagnostics messy.srt
The linter checks for:
- Reading speed — flags blocks with dangerously high CPS (characters per second)
- NLE compatibility — warns about 3+ line blocks and lines >37 chars (Premiere crash risk)
- Timing overlaps — detects blocks where end time > next block's start time
- Whitespace corruption — tabs, trailing spaces, CRLF issues
⚙️ YouTube Caption Sync
subtitle-youtube-sync https://www.youtube.com/watch?v=VIDEO_ID -o captions.vtt
Downloads auto-generated or manually uploaded captions, cleans them, and saves a polished file ready for upload or NLE import.
🧪 Tests
Unit Tests
Verify individual module features and parser stability:
python -m unittest test_cleaner.py
# Expected: 19 tests, all passing
Regression Tests
Test the cleaner automatically against all collected real-world corruptions in the dataset:
subtitle-regression
# Expected: Runs and passes all dataset test cases with 0 critical errors remaining
📁 Project Structure
subtitle-cleaner/
├── cleaner.py # Core CLI tool
├── diagnostics.py # Linting & validation engine
├── youtube_sync.py # YouTube caption downloader-cleaner
├── detectors/ # Modular lint checkers
│ ├── reading_speed.py
│ ├── nle_compatibility.py
│ └── ...
├── sample_input.srt # Test fixture
├── sample_vocab.json # Example vocabulary map
└── test_cleaner.py # 15 unit + integration tests
🤝 Contributing
Found a subtitle file that breaks the parser? Submit an issue and attach the file — we're building a structured corruption dataset to improve repair reliability.
Pull requests welcome. See CONTRIBUTING.md for guidelines.
☕ Support & Funding
If subtitle-cleaner saves you hours of manual editing, consider supporting the project:
- Ko-fi: Support r1ngotchi on Ko-fi
- GitHub Sponsors: Sponsor r1ngotchi
📄 License
MIT — free to use, modify, and distribute.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file subtitle_cleaner-0.3.1.tar.gz.
File metadata
- Download URL: subtitle_cleaner-0.3.1.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ac7e7a545ca0986eb8bb1c48018c919cb2139943a496f379ad680e0bb4bfb49
|
|
| MD5 |
ff6ac5af3dae6eb72c106b0b55690678
|
|
| BLAKE2b-256 |
e880a0796443d36bad5fcf9e3153129331dfb7c4cea62706850be509f9ef574d
|
File details
Details for the file subtitle_cleaner-0.3.1-py3-none-any.whl.
File metadata
- Download URL: subtitle_cleaner-0.3.1-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f2e8bc53d0f1ffa1dbc863a7c0912117bf92e436c9a1faaf881d1b93d65a42e
|
|
| MD5 |
59f209802a4067acf6bcc78963e0f58e
|
|
| BLAKE2b-256 |
bdd15cee70319cc71c7ae14d87ecfc33fcb54651a001fa7a3db935b8335d7360
|