Skip to main content

VoiceTut-TTS — Egyptian Arabic & code-switching text-to-speech (fine-tuned OmniVoice).

Project description

VoiceTut-TTS

𓋹 VoiceTut-TTS

The best open-source text-to-speech model for Egyptian Arabic & code-switching

🤗 Model 🤗 Space 🎧 Samples PyPI Base License Open In Colab

🎧 Listen: hear VoiceTut-TTS vs. the base OmniVoice on Egyptian Arabic & code-switching → Audio demos · 🚀 Try it live: HuggingFace Space

VoiceTut-TTS is an Egyptian-Arabic text-to-speech system fine-tuned from OmniVoice on ~380 hours of Egyptian podcast speech. It produces natural Egyptian speech with seamless Arabic ↔ English code-switching, ships 15 built-in studio voices, supports zero-shot voice cloning, and includes a robust Egyptian-Arabic text normalization pipeline plus true streaming for long text.

Why "VoiceTut"? Tut — after the boy-king Tutankhamun (توت عنخ آمون) — anchors the model in Egyptian identity, just as our companion ASR model QwenCleo-ASR is named after Cleopatra. Together they form an Egyptian speech stack: Cleo listens, Tut speaks. 🎙️🗣️

✨ Features

  • 🎯 Egyptian-first — fine-tuned specifically on Egyptian Arabic, not generic MSA.
  • 🔀 Code-switching — handles real Arabic + English mixed speech (عندي meeting بكرة).
  • 🗣️ 15 built-in voices — male & female studio speakers, each with style tags.
  • 🧬 Zero-shot cloning — clone any voice from a few seconds of reference audio.
  • 🔢 Robust normalization — numbers, dates, times, currencies, phones, emails, URLs, abbreviations + a diacritics override table and a custom lexicon.
  • True streaming — long text is split into sentences and yielded as audio chunks for low time-to-first-audio.
  • 📦 pip-installablepip install voicetut-tts (+ OmniVoice from GitHub), or clone and run locally.

📊 Performance

Measured on a single NVIDIA T4 (Colab), float16, num_step=32. Reproduce with examples/04_evaluation.ipynb.

Metric Value
Real-time factor (RTF, mean) 1.13×
RTF (best) 0.49×
Time-to-first-audio (streaming) 1.68 s
Peak VRAM (fp16) 2.93 GB
WER — Egyptian Arabic 0.40
WER — English 0.07
Speaker similarity (cloning, cosine) 0.83
Naturalness (UTMOS, 1–5) 3.47
Sampling rate 24 kHz

Measured on a T4; expect markedly lower RTF / TTFA on an A100 / H100. RTF and TTFA scale with num_step — drop to num_step=16 for faster, slightly lower-quality output.

🗣️ Sample Outputs

🎧 Listen to all built-in voices & VoiceTut vs. base OmniVoice comparisons → — pure Egyptian, code-switching, English, normalization, per-speaker, customer-service, and zero-shot cloning samples. Or try them live in the HuggingFace Space.

📦 Installation

Option A — PyPI

# 1. PyTorch matching your CUDA (see https://pytorch.org)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# 2. The OmniVoice backbone (not on PyPI, so install it from GitHub)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# 3. VoiceTut-TTS
pip install voicetut-tts
# optional: web UI deps
pip install "voicetut-tts[web]"

Option B — from source

git clone https://github.com/MohammedAly22/VoiceTuT-TTS.git
cd VoiceTuT-TTS
conda create -n voicetut python=3.10 -y && conda activate voicetut
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/k2-fsa/OmniVoice.git    # backbone
pip install -e ".[web,dev]"

🚀 Usage

Python API

from voicetut_tts import VoiceTutTTS

tts = VoiceTutTTS.from_pretrained("mohammedaly22/VoiceTut-TTS")

# 1) Built-in speaker
tts.synthesize("ازيك عامل ايه النهاردة؟", speaker="Mohamed", output="out.wav")

# 2) Zero-shot voice cloning
tts.synthesize("النهارده الجو حلو اوي",
               ref_audio="my_voice.wav", ref_text="ده الصوت بتاعي",
               output="clone.wav")

# 3) Generation parameters
tts.synthesize("عندي meeting الساعة 3:30",
               speaker="Asmaa", num_step=48, guidance_scale=2.5, speed=1.1,
               output="cs.wav")

True streaming (long text)

import sounddevice as sd
for sr, chunk in tts.stream(long_paragraph, speaker="Sayed"):
    sd.play(chunk, sr); sd.wait()          # play each sentence as it's ready

# or write a single concatenated file
tts.synthesize_long(long_paragraph, "long.wav", speaker="Sayed")

Text normalization & custom lexicon

from voicetut_tts import ArabicNormalizer

norm = ArabicNormalizer()
norm("عندي 250 جنيه والميعاد 3:30 يوم 14/3/2024")
# -> "عندي ميتين وخمسين جنيه والميعاد تلاتة و نص يوم أربعتاشر مارس ألفين وأربعة وعشرين"

What gets normalized

Input Output (spoken)
3:30 / 7:45 / 9:50 تلاتة و نص / تمانية الا ربع / عشرة الا عشرة (colloquial Egyptian clock)
01147450629 زيرو حداشر سبعة وأربعين خمسة وأربعين ستة تسعة وعشرين (Egyptian prefix + 2-digit groups)
Ahmed / Mohamed / Mona أحمد / محمد / منى (English→Arabic name map)
250 جنيه / 75$ / 25% ميتين وخمسين جنيه / خمسة وسبعين دولار / خمسة وعشرين في المية
14/3/2024 أربعتاشر مارس ألفين وأربعة وعشرين
a.b@gmail.com a dot b at gmail dot com

Override tables & runtime dictionaries

# diacritized-form overrides (win over the CSV table)
tts.add_lexicon({"تيوت": "تُوت", "نايل": "نَايِل"})

# English-name -> Arabic for correct pronunciation
tts.add_names({"Ziad": "زياد", "Kareem": "كريم"})

Editable data tables ship with the package:

CLI

voicetut --list-speakers
voicetut --text "ازيك عامل ايه؟" --speaker Mohamed --output out.wav
voicetut --text "نص طويل..." --speaker Sayed --stream --output long.wav

🌐 Serving

A custom-styled Gradio web UI (black/white + blue theme, speaker dropdown with gender + style tags, reference preview, voice cloning with mic recording, AR/EN language switch, generation params, examples):

pip install "voicetut-tts[web]"
python app.py                                   # default HF checkpoint, port 7860
OMNICLEO_CKPT=exp/omnivoice_egy/checkpoint-8000 python app.py   # local checkpoint
OMNICLEO_SHARE=1 python app.py                  # public share link

📓 Examples (Colab)

Notebook Description
01_quickstart.ipynb Install, load, synthesize with a built-in voice Open In Colab
02_voice_cloning.ipynb Zero-shot cloning + normalization & lexicon Open In Colab
03_web_ui.ipynb Launch the Gradio web UI from Colab (public link) Open In Colab
04_evaluation.ipynb Measure RTF, TTFA, VRAM, WER, similarity, UTMOS Open In Colab

🔗 Links

📜 License & Citation

Released under the Apache-2.0 license.

@software{voicetut_tts_2026,
  author  = {Mohammed Aly},
  title   = {VoiceTut-TTS: Egyptian Arabic & Code-Switching Text-to-Speech},
  year    = {2026},
  url     = {https://github.com/MohammedAly22/VoiceTuT-TTS},
  note    = {Fine-tuned from OmniVoice}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicetut_tts-0.1.1.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voicetut_tts-0.1.1-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file voicetut_tts-0.1.1.tar.gz.

File metadata

  • Download URL: voicetut_tts-0.1.1.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for voicetut_tts-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a555981ef302e62954c0667209cbffebee5ba684a97b0f51e9f43b74f31225c4
MD5 516b2485b5a09589781fe097befb6483
BLAKE2b-256 1cc3256ff6e7282022bf6a2f505d6ea1ce8346d13053d365351761b8655b5e79

See more details on using hashes here.

File details

Details for the file voicetut_tts-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: voicetut_tts-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for voicetut_tts-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ad600428fde721d5f991d6dad348e55a44eb0b08d2bfb499389d553a9e0febee
MD5 7781879249db0cc6904dddfe4a6fdc3b
BLAKE2b-256 8186aabd64792f174eab82bdb6ff672424f80a86f451194e5f45c3efcf58b001

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page