VoiceTut-TTS — Egyptian Arabic & code-switching text-to-speech (fine-tuned OmniVoice).
Project description
🎧 Listen: hear VoiceTut-TTS vs. the base OmniVoice on Egyptian Arabic & code-switching → Audio demos · 🚀 Try it live: HuggingFace Space
VoiceTut-TTS is an Egyptian-Arabic text-to-speech system fine-tuned from OmniVoice on ~380 hours of Egyptian podcast speech. It produces natural Egyptian speech with seamless Arabic ↔ English code-switching, ships 15 built-in studio voices, supports zero-shot voice cloning, and includes a robust Egyptian-Arabic text normalization pipeline plus true streaming for long text.
Why "VoiceTut"? Tut — after the boy-king Tutankhamun (توت عنخ آمون) — anchors the model in Egyptian identity, just as our companion ASR model QwenCleo-ASR is named after Cleopatra. Together they form an Egyptian speech stack: Cleo listens, Tut speaks. 🎙️🗣️
✨ Features
- 🎯 Egyptian-first — fine-tuned specifically on Egyptian Arabic, not generic MSA.
- 🔀 Code-switching — handles real Arabic + English mixed speech (
عندي meeting بكرة). - 🗣️ 15 built-in voices — male & female studio speakers, each with style tags.
- 🧬 Zero-shot cloning — clone any voice from a few seconds of reference audio.
- 🔢 Robust normalization — numbers, dates, times, currencies, phones, emails, URLs, abbreviations + a diacritics override table and a custom lexicon.
- ⚡ True streaming — long text is split into sentences and yielded as audio chunks for low time-to-first-audio.
- 📦 pip-installable —
pip install voicetut-tts(+ OmniVoice from GitHub), or clone and run locally.
📊 Performance
Measured on a single NVIDIA T4 (Colab),
float16,num_step=32. Reproduce withexamples/04_evaluation.ipynb.
| Metric | Value |
|---|---|
| Real-time factor (RTF, mean) | 1.13× |
| RTF (best) | 0.49× |
| Time-to-first-audio (streaming) | 1.68 s |
| Peak VRAM (fp16) | 2.93 GB |
| WER — Egyptian Arabic | 0.40 |
| WER — English | 0.07 |
| Speaker similarity (cloning, cosine) | 0.83 |
| Naturalness (UTMOS, 1–5) | 3.47 |
| Sampling rate | 24 kHz |
Measured on a T4; expect markedly lower RTF / TTFA on an A100 / H100. RTF and TTFA scale with
num_step— drop tonum_step=16for faster, slightly lower-quality output.
🗣️ Sample Outputs
🎧 Listen to all built-in voices & VoiceTut vs. base OmniVoice comparisons → — pure Egyptian, code-switching, English, normalization, per-speaker, customer-service, and zero-shot cloning samples. Or try them live in the HuggingFace Space.
📦 Installation
Option A — PyPI
# 1. PyTorch matching your CUDA (see https://pytorch.org)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# 2. The OmniVoice backbone (not on PyPI, so install it from GitHub)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# 3. VoiceTut-TTS
pip install voicetut-tts
# optional: web UI deps
pip install "voicetut-tts[web]"
Option B — from source
git clone https://github.com/MohammedAly22/VoiceTuT-TTS.git
cd VoiceTuT-TTS
conda create -n voicetut python=3.10 -y && conda activate voicetut
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/k2-fsa/OmniVoice.git # backbone
pip install -e ".[web,dev]"
🚀 Usage
Python API
from voicetut_tts import VoiceTutTTS
tts = VoiceTutTTS.from_pretrained("mohammedaly22/VoiceTut-TTS")
# 1) Built-in speaker
tts.synthesize("ازيك عامل ايه النهاردة؟", speaker="Mohamed", output="out.wav")
# 2) Zero-shot voice cloning
tts.synthesize("النهارده الجو حلو اوي",
ref_audio="my_voice.wav", ref_text="ده الصوت بتاعي",
output="clone.wav")
# 3) Generation parameters
tts.synthesize("عندي meeting الساعة 3:30",
speaker="Asmaa", num_step=48, guidance_scale=2.5, speed=1.1,
output="cs.wav")
True streaming (long text)
import sounddevice as sd
for sr, chunk in tts.stream(long_paragraph, speaker="Sayed"):
sd.play(chunk, sr); sd.wait() # play each sentence as it's ready
# or write a single concatenated file
tts.synthesize_long(long_paragraph, "long.wav", speaker="Sayed")
Text normalization & custom lexicon
from voicetut_tts import ArabicNormalizer
norm = ArabicNormalizer()
norm("عندي 250 جنيه والميعاد 3:30 يوم 14/3/2024")
# -> "عندي ميتين وخمسين جنيه والميعاد تلاتة و نص يوم أربعتاشر مارس ألفين وأربعة وعشرين"
What gets normalized
| Input | Output (spoken) |
|---|---|
3:30 / 7:45 / 9:50 |
تلاتة و نص / تمانية الا ربع / عشرة الا عشرة (colloquial Egyptian clock) |
01147450629 |
زيرو حداشر سبعة وأربعين خمسة وأربعين ستة تسعة وعشرين (Egyptian prefix + 2-digit groups) |
Ahmed / Mohamed / Mona |
أحمد / محمد / منى (English→Arabic name map) |
250 جنيه / 75$ / 25% |
ميتين وخمسين جنيه / خمسة وسبعين دولار / خمسة وعشرين في المية |
14/3/2024 |
أربعتاشر مارس ألفين وأربعة وعشرين |
a.b@gmail.com |
a dot b at gmail dot com |
Override tables & runtime dictionaries
# diacritized-form overrides (win over the CSV table)
tts.add_lexicon({"تيوت": "تُوت", "نايل": "نَايِل"})
# English-name -> Arabic for correct pronunciation
tts.add_names({"Ziad": "زياد", "Kareem": "كريم"})
Editable data tables ship with the package:
data/diacritics.csv—word,diacritized(Arabic word → diacritized form)data/names_en_ar.csv—english,arabic(name transliteration)
CLI
voicetut --list-speakers
voicetut --text "ازيك عامل ايه؟" --speaker Mohamed --output out.wav
voicetut --text "نص طويل..." --speaker Sayed --stream --output long.wav
🌐 Serving
A custom-styled Gradio web UI (black/white + blue theme, speaker dropdown with gender + style tags, reference preview, voice cloning with mic recording, AR/EN language switch, generation params, examples):
pip install "voicetut-tts[web]"
python app.py # default HF checkpoint, port 7860
OMNICLEO_CKPT=exp/omnivoice_egy/checkpoint-8000 python app.py # local checkpoint
OMNICLEO_SHARE=1 python app.py # public share link
📓 Examples (Colab)
| Notebook | Description | |
|---|---|---|
| 01_quickstart.ipynb | Install, load, synthesize with a built-in voice | |
| 02_voice_cloning.ipynb | Zero-shot cloning + normalization & lexicon | |
| 03_web_ui.ipynb | Launch the Gradio web UI from Colab (public link) | |
| 04_evaluation.ipynb | Measure RTF, TTFA, VRAM, WER, similarity, UTMOS |
🔗 Links
- 🤗 Model: https://huggingface.co/mohammedaly22/VoiceTut-TTS
- 📦 PyPI: https://pypi.org/project/voicetut-tts/
- 🧠 Base model: https://github.com/k2-fsa/OmniVoice
- 🎧 Companion ASR: https://github.com/MohammedAly22/qwencleo-asr
📜 License & Citation
Released under the Apache-2.0 license.
@software{voicetut_tts_2026,
author = {Mohammed Aly},
title = {VoiceTut-TTS: Egyptian Arabic & Code-Switching Text-to-Speech},
year = {2026},
url = {https://github.com/MohammedAly22/VoiceTuT-TTS},
note = {Fine-tuned from OmniVoice}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voicetut_tts-0.1.1.tar.gz.
File metadata
- Download URL: voicetut_tts-0.1.1.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a555981ef302e62954c0667209cbffebee5ba684a97b0f51e9f43b74f31225c4
|
|
| MD5 |
516b2485b5a09589781fe097befb6483
|
|
| BLAKE2b-256 |
1cc3256ff6e7282022bf6a2f505d6ea1ce8346d13053d365351761b8655b5e79
|
File details
Details for the file voicetut_tts-0.1.1-py3-none-any.whl.
File metadata
- Download URL: voicetut_tts-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad600428fde721d5f991d6dad348e55a44eb0b08d2bfb499389d553a9e0febee
|
|
| MD5 |
7781879249db0cc6904dddfe4a6fdc3b
|
|
| BLAKE2b-256 |
8186aabd64792f174eab82bdb6ff672424f80a86f451194e5f45c3efcf58b001
|