AI dubbing system uses machine learning models to automatically translate and synchronize audio dialogue into different languages.
Project description
Introduction
Open dubbing is an AI dubbing system uses machine learning models to automatically translate and synchronize audio dialogue into different languages.
At the moment, it is pure experimental and an excuse to help me to understand better STT, TTS and translation systems combined together.
Features
- Build on top of open source models and able to run it locally
- Dubs automatically a video from a source to a target language
- Supports multiple Text To Speech (TTS) engines
- Gender voice detection to allow to assign properly synthetic voice
Roadmap
Areas what we will like to explore:
- Automatic detection of the source language of the video (using Whisper)
- Better control of voice used for dubbing
- Support for TTS systems
- Optimize it for long videos and less resource usage
- Support for multiple video input formats
Demo
This video on propose shows the strengths and limitations of the system.
Original English video
https://github.com/user-attachments/assets/54c0d37f-0cc8-4ea2-8f8d-fd2d2f4eeccc
Automatic dubbed video in Catalan
https://github.com/user-attachments/assets/99936655-5851-4d0c-827b-f36f79f56190
Limitations
- This is an experimental project
- Automatic video dubbing includes speech recognition, translation, vocal recognition, etc. At each one of these steps errors can be introduced
Supported languages
The support languages depends on the combination of text to speech, translation system and text to speech system used. With Coqui TTS, these are the languages supported (I only tested a very few of them):
Supported source languages: Afrikaans, Amharic, Armenian, Assamese, Bashkir, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Gujarati, Haitian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Lingala, Lithuanian, Luxembourgish, Macedonian, Malayalam, Maltese, Maori, Marathi, Modern Greek (1453-), Norwegian Nynorsk, Occitan (post 1500), Panjabi, Polish, Portuguese, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Vietnamese, Welsh, Yoruba, Yue Chinese
Supported target languages: Achinese, Akan, Amharic, Assamese, Awadhi, Ayacucho Quechua, Balinese, Bambara, Bashkir, Basque, Bemba (Zambia), Bengali, Bulgarian, Burmese, Catalan, Cebuano, Central Aymara, Chhattisgarhi, Crimean Tatar, Dutch, Dyula, Dzongkha, English, Ewe, Faroese, Fijian, Finnish, Fon, French, Ganda, German, Guarani, Gujarati, Haitian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Iloko, Indonesian, Javanese, Kabiyè, Kabyle, Kachin, Kannada, Kazakh, Khmer, Kikuyu, Kinyarwanda, Kirghiz, Korean, Lao, Magahi, Maithili, Malayalam, Marathi, Minangkabau, Modern Greek (1453-), Mossi, North Azerbaijani, Northern Kurdish, Nuer, Nyanja, Odia, Pangasinan, Panjabi, Papiamento, Polish, Portuguese, Romanian, Rundi, Russian, Samoan, Sango, Shan, Shona, Somali, South Azerbaijani, Southwestern Dinka, Spanish, Sundanese, Swahili (individual language), Swedish, Tagalog, Tajik, Tamasheq, Tamil, Tatar, Telugu, Thai, Tibetan, Tigrinya, Tok Pisin, Tsonga, Turkish, Turkmen, Uighur, Ukrainian, Urdu, Vietnamese, Waray (Philippines), Welsh, Yoruba
Installation
Install dependencies
Linux:
sudo apt install ffmpeg
Mac OS
brew install ffmpeg
If you are going to use Coqui-tts you also need to install espeak-ng:
sudo apt install espeak-ng
Mac OS
brew install espeak-ng
Install package:
pip install open_dubbing
Accept pyannote license
- Go to and Accept
pyannote/segmentation-3.0
user conditions - Accept
pyannote/speaker-diarization-3.1
user conditions - Go to and access token at
hf.co/settings/tokens
.
Usage
Quick start
open-dubbing --input_file video.mp4 --target_language=cat --hugging_face_token=TOKEN
Where TOKEN is the HuggingFace token that allows to access the models
To get a list of available options:
open-dubbing --help
Libraries used
Core libraries used:
- demucs to separate vocals from the audio
- pyannote-audio to diarize speakers
- faster-whisper for audio to speech
- NLLB-200 for machine translation
- TTS
And very special thanks to ariel from which we leveraged parts of their code base.
License
See license
How it works
The system follows these steps:
- Isolate the speech from background noise, music, and other non-speech elements in the audio.
- Segment the audio in fragments where there is voice and identify the speakers (speaker diarization).
- Identify the gender of the speakers.
- Transcribe the speech into text using OpenAI Whisper.
- Translate the text from source language (e.g. English) to target language (e.g. Catalan).
- Synthesize speech using a Text to Speech System using voices that match the gender and adjusting speed.
- The final dubbed video is then assembled, combining the synthetic audio with the original video footage, including any background sounds or music that were isolated earlier.
There are 6 different AI models applied during the dubbing process.
Contact
Email address: Jordi Mas: jmas@softcatala.org
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for open_dubbing-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85da3d84b4924f4be0587fff73352a61e8944bca5c707289ec1fc0bd9c5a82b3 |
|
MD5 | 6256ca467b56c063d056ec02b6357eff |
|
BLAKE2b-256 | 89aac57ba8cf200419c5e44b33a206d357a891f49e080b91557a495070376625 |