Skip to main content

VoiceTut-TTS — Egyptian Arabic & code-switching text-to-speech (fine-tuned OmniVoice).

Project description

VoiceTut-TTS

𓋹 VoiceTut-TTS

The best open-source text-to-speech model for Egyptian Arabic & code-switching

🤗 Model 🤗 Space 🎧 Samples PyPI Base License Open In Colab

🎧 Listen: hear VoiceTut-TTS vs. the base OmniVoice on Egyptian Arabic & code-switching → Audio demos · 🚀 Try it live: HuggingFace Space

VoiceTut-TTS is an Egyptian-Arabic text-to-speech system fine-tuned from OmniVoice on ~380 hours of Egyptian podcast speech. It produces natural Egyptian speech with seamless Arabic ↔ English code-switching, ships 15 built-in studio voices, supports zero-shot voice cloning, and includes a robust Egyptian-Arabic text normalization pipeline plus true streaming for long text.

Why "VoiceTut"? Tut — after the boy-king Tutankhamun (توت عنخ آمون) — anchors the model in Egyptian identity, just as our companion ASR model QwenCleo-ASR is named after Cleopatra. Together they form an Egyptian speech stack: Cleo listens, Tut speaks. 🎙️🗣️

✨ Features

  • 🎯 Egyptian-first — fine-tuned specifically on Egyptian Arabic, not generic MSA.
  • 🔀 Code-switching — handles real Arabic + English mixed speech (عندي meeting بكرة).
  • 🗣️ 15 built-in voices — male & female studio speakers, each with style tags.
  • 🧬 Zero-shot cloning — clone any voice from a few seconds of reference audio.
  • 🔢 Robust normalization — numbers, dates, times, currencies, phones, emails, URLs, abbreviations + a diacritics override table and a custom lexicon.
  • True streaming — long text is split into sentences and yielded as audio chunks for low time-to-first-audio.
  • 📦 pip-installablepip install voicetut-tts (+ OmniVoice from GitHub), or clone and run locally.

📊 Performance

Measured on a single NVIDIA H100 80GB, float16, num_step=32. Numbers are indicative; see examples/ to reproduce.

Metric Value
Real-time factor (RTF) ~0.10 (≈10× faster than real-time)
Time-to-first-audio (streaming, 1st sentence) ~0.4–0.7 s
Peak VRAM (inference, fp16) ~6.5 GB
Sampling rate 24 kHz
Speaker similarity (cloning, cosine) 0.78
Naturalness (internal MOS, 1–5) 4.1

TTFA and RTF scale with num_step; drop to num_step=16 for faster, slightly lower-quality output.

🗣️ Sample Outputs

Type Text Speaker
Pure Egyptian ازيك عامل ايه النهاردة؟ يا رب تكون كويس Mohamed
Code-switching عندي meeting الساعة 3:30 ومعايا ال presentation Asmaa
Long / streaming (multi-sentence paragraph, streamed) Sayed

🎧 Listen to all built-in voices in the web demo or the Colab notebook.

📦 Installation

Option A — PyPI

# 1. PyTorch matching your CUDA (see https://pytorch.org)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# 2. The OmniVoice backbone (not on PyPI, so install it from GitHub)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# 3. VoiceTut-TTS
pip install voicetut-tts
# optional: web UI deps
pip install "voicetut-tts[web]"

Option B — from source

git clone https://github.com/MohammedAly22/VoiceTuT-TTS.git
cd VoiceTuT-TTS
conda create -n voicetut python=3.10 -y && conda activate voicetut
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/k2-fsa/OmniVoice.git    # backbone
pip install -e ".[web,dev]"

🚀 Usage

Python API

from voicetut_tts import VoiceTutTTS

tts = VoiceTutTTS.from_pretrained("mohammedaly22/VoiceTut-TTS")

# 1) Built-in speaker
tts.synthesize("ازيك عامل ايه النهاردة؟", speaker="Mohamed", output="out.wav")

# 2) Zero-shot voice cloning
tts.synthesize("النهارده الجو حلو اوي",
               ref_audio="my_voice.wav", ref_text="ده الصوت بتاعي",
               output="clone.wav")

# 3) Generation parameters
tts.synthesize("عندي meeting الساعة 3:30",
               speaker="Asmaa", num_step=48, guidance_scale=2.5, speed=1.1,
               output="cs.wav")

True streaming (long text)

import sounddevice as sd
for sr, chunk in tts.stream(long_paragraph, speaker="Sayed"):
    sd.play(chunk, sr); sd.wait()          # play each sentence as it's ready

# or write a single concatenated file
tts.synthesize_long(long_paragraph, "long.wav", speaker="Sayed")

Text normalization & custom lexicon

from voicetut_tts import ArabicNormalizer

norm = ArabicNormalizer()
norm("عندي 250 جنيه والميعاد 3:30 يوم 14/3/2024")
# -> "عندي ميتين وخمسين جنيه والميعاد تلاتة و نص يوم أربعتاشر مارس ألفين وأربعة وعشرين"

What gets normalized

Input Output (spoken)
3:30 / 7:45 / 9:50 تلاتة و نص / تمانية الا ربع / عشرة الا عشرة (colloquial Egyptian clock)
01147450629 زيرو حداشر سبعة وأربعين خمسة وأربعين ستة تسعة وعشرين (Egyptian prefix + 2-digit groups)
Ahmed / Mohamed / Mona أحمد / محمد / منى (English→Arabic name map)
250 جنيه / 75$ / 25% ميتين وخمسين جنيه / خمسة وسبعين دولار / خمسة وعشرين في المية
14/3/2024 أربعتاشر مارس ألفين وأربعة وعشرين
a.b@gmail.com a dot b at gmail dot com

Override tables & runtime dictionaries

# diacritized-form overrides (win over the CSV table)
tts.add_lexicon({"تيوت": "تُوت", "نايل": "نَايِل"})

# English-name -> Arabic for correct pronunciation
tts.add_names({"Ziad": "زياد", "Kareem": "كريم"})

Editable data tables ship with the package:

CLI

voicetut --list-speakers
voicetut --text "ازيك عامل ايه؟" --speaker Mohamed --output out.wav
voicetut --text "نص طويل..." --speaker Sayed --stream --output long.wav

🌐 Serving

A custom-styled Gradio web UI (black/white + blue theme, speaker dropdown with gender + style tags, reference preview, voice cloning with mic recording, AR/EN language switch, generation params, examples):

pip install "voicetut-tts[web]"
python app.py                                   # default HF checkpoint, port 7860
OMNICLEO_CKPT=exp/omnivoice_egy/checkpoint-8000 python app.py   # local checkpoint
OMNICLEO_SHARE=1 python app.py                  # public share link

📓 Examples (Colab)

Notebook Description
01_quickstart.ipynb Install, load, synthesize with a built-in voice Open In Colab
02_voice_cloning.ipynb Zero-shot cloning + normalization & lexicon Open In Colab
03_web_ui.ipynb Launch the Gradio web UI from Colab (public link) Open In Colab

🔗 Links

📜 License & Citation

Released under the Apache-2.0 license.

@software{voicetut_tts_2026,
  author  = {Mohammed Aly},
  title   = {VoiceTut-TTS: Egyptian Arabic & Code-Switching Text-to-Speech},
  year    = {2026},
  url     = {https://github.com/MohammedAly22/VoiceTuT-TTS},
  note    = {Fine-tuned from OmniVoice}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicetut_tts-0.1.0.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voicetut_tts-0.1.0-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file voicetut_tts-0.1.0.tar.gz.

File metadata

  • Download URL: voicetut_tts-0.1.0.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for voicetut_tts-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c205aebf56044d1c1a039f2413870650a6ddffebbd759cde821e65a0db404164
MD5 e4b79df3749ef069bf6f5da9f7f712d1
BLAKE2b-256 d33dba56261e94fd91f035867e669f9f85c140add9155304eacd5ac99326f88a

See more details on using hashes here.

File details

Details for the file voicetut_tts-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: voicetut_tts-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for voicetut_tts-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3057df375f3ec406cb75825bbb12668f6d2457349fce375f2d63f0c84a0efb3
MD5 0a3c719d6a9e8f414d4b46250c84a240
BLAKE2b-256 11dea44c10fe2e9e8c2059d118b8542b31a435850308890f5964094844fbd34b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page