Skip to main content

Semantic subtitle aligner and merger for bilingual subtitle syncing.

Project description

🎬 DuoSubs

CI PyPI version Python Versions License: Apache-2.0 Type Checked: Mypy Code Style: Ruff codecov Documentation Status

Merging subtitles using only the nearest timestamp often leads to incorrect pairings — lines may end up out of sync, duplicated, or mismatched.

This Python tool uses semantic similarity (via Sentence Transformers) to align subtitle lines based on meaning instead of timestamps — making it possible to pair subtitles across different languages.


✨ Features

  • 📌 Aligns subtitle lines based on meaning, not timing
  • 🌍 Multilingual support based on the user selected Sentence Transformer model
  • 🧩 Easy-to-use API for integration
  • 💻 Command-line interface with customizable options
  • 📄 Flexible format support — works with SRT, VTT, MPL2, TTML, ASS, SSA files

🛠️ Installation

  1. Install the correct version of PyTorch for system by following the official instructions: https://pytorch.org/get-started/locally
  2. Install this repo via pip:
    pip install duosubs
    

🚀 Usage

With the demo files provided, here are the simplest way to get started:

  • via command line

    duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt
    
  • via Python API

    from duosubs import MergeArgs, run_merge_pipeline
    
    # Store all arguments
    args = MergeArgs(
        primary="demo/primary_sub.srt",
        secondary="demo/secondary_sub.srt"
    )
    
    # Load, merge, and save subtitles.
    run_merge_pipeline(args, print)
    

These codes will produce primary_sub.zip, with the following structure:

primary_sub.zip
├── primary_sub_combined.ass   # Merged subtitles
├── primary_sub_primary.ass    # Original primary subtitles
└── primary_sub_secondary.ass  # Time-shifted secondary subtitles

By default, the Sentence Transformer model used is LaBSE.

If you want to experiment with different models, then pick one from 🤗 Hugging Face or check out from the leaderboard for top performing model.

For example, if the model chosen is Qwen/Qwen3-Embedding-0.6B, you can run:

  • via command line

    duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt --model Qwen/Qwen3-Embedding-0.6B
    
  • via Python API

    from duosubs import MergeArgs, run_merge_pipeline
    
    # Store all arguments
    args = MergeArgs(
        primary="demo/primary_sub.srt",
        secondary="demo/secondary_sub.srt",
        model="Qwen/Qwen3-Embedding-0.6B"
    )
    
    # Load, merge, and save subtitles.
    run_merge_pipeline(args, print)
    

⚠️ Warning

  • Some models may require significant RAM or GPU (VRAM) to run, and might not be compatible with all devices — especially larger models.
  • Also, please ensure the selected model supports your desired language for reliable results.

To learn more about this tool, please see the documentation.


📚 Behind the Scenes

  1. Parse subtitles and detect language.
  2. Tokenize subtitle lines.
  3. Extract and filter non-overlapping subtitles. (Optional)
  4. Estimate tokenized subtitle pairings using DTW.
  5. Refine alignment using a sliding window approach.
  6. Combine aligned and non-overlapping subtitles.
  7. Eliminate unnecessary newline within subtitle lines.

🚫 Known Limitations

  • The accuracy of the merging process varies on the model selected.
  • Some models may produce unreliable results for unsupported or low-resource languages.
  • Some sentence fragments from secondary subtitles may be misaligned to the primary subtitles line due to the tokenization algorithm used.
  • Secondary subtitles might contain extra whitespace as a result of token-level merging.
  • The algorithm may not work reliably if the timestamps of some matching lines don’t overlap at all. See special case.

🧩 Special Case

For the last known limitation, if both subtitle files are known to be perfectly semantically aligned, meaning:

  • matching dialogue contents
  • no extra lines like scene annotations or bonus Director’s Cut stuff.

Then, just enable the --ignore-non-overlap-filter CLI option to skip the overlap check — the merge should go smoothly from there.

⚠️ If the subtitle timings are off and the two subtitle files don’t fully match in content, the algorithm likely won’t produce great results. Still, you can try running it with --ignore-non-overlap-filter enabled.


🙏 Acknowledgements

This project wouldn't be possible without the incredible work of the open-source community. Special thanks to:


🤝 Contributing

Contributions are welcome! If you'd like to submit a pull request, please check out the contributing guidelines.


🔑 License

Apache-2.0 license - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duosubs-0.1.0.tar.gz (31.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duosubs-0.1.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file duosubs-0.1.0.tar.gz.

File metadata

  • Download URL: duosubs-0.1.0.tar.gz
  • Upload date:
  • Size: 31.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for duosubs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 49ae4a461318992f75c0b9c3cef46c78ea58ba9e4bc016b79175f5b0002ea32a
MD5 bd57de8aafdaf7de1029900198d2b31c
BLAKE2b-256 65c4051d48500af1667e64709f08cc699ffb10fd532b1f2f48d8323dc9e7b772

See more details on using hashes here.

File details

Details for the file duosubs-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: duosubs-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 33.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for duosubs-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 10e21cb4bfcd8ae49b7bf30f3defb5667e5437554d43cb565a97c1ddf2ed5326
MD5 8dfeae2a217d9cf1db94d015d518878f
BLAKE2b-256 0f9b561413628d8cf60a5956c96908a18219f981a224fd194cbc3285d2fb9d82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page