A local-first subtitle sanitizer, readability optimizer, and NLE import formatter

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

subtitle-cleaner

A local-first Python CLI toolkit for sanitizing, repairing, and optimizing AI-generated subtitle files.

AI transcription tools (Whisper, CapCut, Premiere's auto-transcribe) save time up front — but the cleanup afterward is brutal. Stuttered words, broken import syntax, filler noise, and ugly line breaks pile up fast.

subtitle-cleaner runs entirely on your machine with no cloud dependencies, no privacy risk, and no subscription.

✨ Features

Feature	Flag	Description
Duplicate word removal	(default)	Case-insensitively removes stutters (`"We're we're"` → `"We're"`)
Filler word stripping	(default)	Removes `um`, `uh`, `like` without breaking timings
HTML/font tag stripping	(default)	Cleans embedded `<font>`, `<i>`, `<b>` tags from SRT/VTT
Arrow & timestamp repair	(default)	Normalizes unicode arrows (`–>`, `->`) to standard `-->`
Index re-sequencing	(default)	Fixes out-of-order subtitle block indices starting from 1
Timestamp normalization	(default)	Pads and corrects malformed `HH:MM:SS,mmm` timestamps
Repair preview (diff)	`--preview` / `-p`	Non-destructive diff of proposed changes before saving
Semantic line breaking	`--segment` / `-s`	Splits long lines at grammatical boundaries (not character counts)
Mobile formatting	`--mobile` / `-m`	30-character line width for 9:16 vertical video (TikTok/Shorts/Reels)
Custom vocabulary map	`--vocab vocab.json`	JSON-based find-and-replace for brand names, jargon, speaker names
NLE optimization	`--nle premiere\|resolve`	Premiere Pro (37-char limit, 2-line max) and Resolve (UTF-8 BOM) modes
Batch processing	`-i /folder/ -o /out/`	Recursively processes entire directories of `.srt`/`.vtt` files
Format conversion	`--format srt\|vtt\|ass` / `-f srt\|vtt\|ass`	Convert between SRT, VTT, and ASS formats (converts timestamps & styles)
Word-level splitting	`--word-split` / `-w`	Splits blocks into single-word timed subtitle blocks (proportionally distributed duration)
Karaoke style export	`--karaoke` / `-k`	Exports to ASS format with highlighting `{\k}` centisecond timings
YouTube caption sync	`youtube_sync.py`	Downloads, cleans, and saves captions from any YouTube URL

🚀 Installation

# Install directly from GitHub
pip install git+https://github.com/r1ngotchi/subtitle-cleaner.git

# Or clone and install in editable mode for development
git clone https://github.com/r1ngotchi/subtitle-cleaner
cd subtitle-cleaner
pip install -e .

💻 Quick Start

Once installed, the toolkit provides three convenient CLI commands:

# Preview what would change (non-destructive)
subtitle-cleaner -i messy.srt --preview

# Clean and save output
subtitle-cleaner -i messy.srt -o clean.srt

# Clean for Premiere Pro import (37-char line limit, 2-line max)
subtitle-cleaner -i messy.srt -o clean.srt --nle premiere

# Clean for DaVinci Resolve (UTF-8 BOM encoding)
subtitle-cleaner -i messy.srt -o clean.srt --nle resolve

# Semantic line breaking + mobile formatting
subtitle-cleaner -i messy.srt -o clean.srt --segment --mobile

# Apply custom vocabulary corrections
subtitle-cleaner -i messy.srt -o clean.srt --vocab my_vocab.json

# Batch process an entire folder
subtitle-cleaner -i ./subtitles/ -o ./cleaned/

# Convert SRT to VTT (strips indices, normalizes dots)
subtitle-cleaner -i input.srt -o output.vtt -f vtt

# Convert VTT to SRT (restores sequential indices, normalizes commas)
subtitle-cleaner -i input.vtt -o output.srt -f srt

# Split subtitles into single-word blocks (e.g. for vertical video captions)
subtitle-cleaner -i input.srt -o output.srt -w

# Convert to ASS format with karaoke highlighting (for animated captions)
subtitle-cleaner -i input.srt -o output.ass -k

# Download and clean YouTube captions directly
subtitle-youtube-sync https://www.youtube.com/watch?v=VIDEO_ID -o output.vtt

📋 Before / After Example

Input (messy.srt):

1
00:00:01,000 -> 00:00:04,000
Yeah, we're we're going to like, um, build this.

1
00:00:04,100 ---> 00:00:06,000
Like, uh, <font color="#ff0000">absolutely.</font>

Output after python cleaner.py -i messy.srt:

1
00:00:01,000 --> 00:00:04,000
Yeah, we're going to build this.

2
00:00:04,100 --> 00:00:06,000
Absolutely.

📦 Custom Vocabulary Map (`--vocab`)

Create a vocab.json file mapping AI mistranscriptions to their correct forms:

{
  "open eye": "OpenAI",
  "adobe premiere pro": "Adobe Premiere Pro",
  "da vinci resolve": "DaVinci Resolve"
}

Then run: python cleaner.py -i messy.srt -o clean.srt --vocab vocab.json

🔬 Diagnostics

Run the linter to get a full report of issues before cleaning:

subtitle-diagnostics messy.srt

The linter checks for:

Reading speed — flags blocks with dangerously high CPS (characters per second)
NLE compatibility — warns about 3+ line blocks and lines >37 chars (Premiere crash risk)
Timing overlaps — detects blocks where end time > next block's start time
Whitespace corruption — tabs, trailing spaces, CRLF issues

⚙️ YouTube Caption Sync

subtitle-youtube-sync https://www.youtube.com/watch?v=VIDEO_ID -o captions.vtt

Downloads auto-generated or manually uploaded captions, cleans them, and saves a polished file ready for upload or NLE import.

🧪 Tests

Unit Tests

Verify individual module features and parser stability:

python -m unittest test_cleaner.py
# Expected: 19 tests, all passing

Regression Tests

Test the cleaner automatically against all collected real-world corruptions in the dataset:

subtitle-regression
# Expected: Runs and passes all dataset test cases with 0 critical errors remaining

📁 Project Structure

subtitle-cleaner/
├── cleaner.py          # Core CLI tool
├── diagnostics.py      # Linting & validation engine
├── youtube_sync.py     # YouTube caption downloader-cleaner
├── detectors/          # Modular lint checkers
│   ├── reading_speed.py
│   ├── nle_compatibility.py
│   └── ...
├── sample_input.srt    # Test fixture
├── sample_vocab.json   # Example vocabulary map
└── test_cleaner.py     # 15 unit + integration tests

🤝 Contributing

Found a subtitle file that breaks the parser? Submit an issue and attach the file — we're building a structured corruption dataset to improve repair reliability.

Pull requests welcome. See CONTRIBUTING.md for guidelines.

☕ Support & Funding

If subtitle-cleaner saves you hours of manual editing, consider supporting the project:

Cryptocurrency: Support us with BTC, ETH, SOL, or Polygon in DONATIONS.md
GitHub Sponsors: Sponsor r1ngotchi

📄 License

MIT — free to use, modify, and distribute.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.3.3

Jun 12, 2026

0.3.2

Jun 12, 2026

0.3.1

Jun 12, 2026

0.3.0

Jun 12, 2026

0.2.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subtitle_cleaner-0.3.3.tar.gz (19.8 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

subtitle_cleaner-0.3.3-py3-none-any.whl (20.9 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file subtitle_cleaner-0.3.3.tar.gz.

File metadata

Download URL: subtitle_cleaner-0.3.3.tar.gz
Upload date: Jun 12, 2026
Size: 19.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for subtitle_cleaner-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`5d55703de0b9b4d68c159b4cfd563fe14f73d70e2ffe23abc58d368faa0ef3d4`
MD5	`a78a6c39be32ad7da4c83e6e2a6a7452`
BLAKE2b-256	`9de8db3b2f270065ff08fa9f2a55920ed6c04c5e1efc9917644a0befd90dcd29`

See more details on using hashes here.

File details

Details for the file subtitle_cleaner-0.3.3-py3-none-any.whl.

File metadata

Download URL: subtitle_cleaner-0.3.3-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for subtitle_cleaner-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`148948a6baec1fcde7abbe55601a6cc084909904ae4cf7b3a4eb798851dc708a`
MD5	`d187c4bf4f198f982f8a970baf9a60f5`
BLAKE2b-256	`50028d9352db6107e6a4b50a70e1402fd5d91b32cbc69c04650d885c49cd6573`

See more details on using hashes here.

subtitle-cleaner 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

subtitle-cleaner

✨ Features

🚀 Installation

💻 Quick Start

📋 Before / After Example

📦 Custom Vocabulary Map (--vocab)

🔬 Diagnostics

⚙️ YouTube Caption Sync

🧪 Tests

Unit Tests

Regression Tests

📁 Project Structure

🤝 Contributing

☕ Support & Funding

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

📦 Custom Vocabulary Map (`--vocab`)