VoiceTut-TTS — Egyptian Arabic & code-switching text-to-speech (fine-tuned OmniVoice).

These details have not been verified by PyPI

Project links

Project description

𓋹 VoiceTut-TTS

The best open-source text-to-speech model for Egyptian Arabic & code-switching

🎧 Listen: hear VoiceTut-TTS vs. the base OmniVoice on Egyptian Arabic & code-switching → Audio demos · 🚀 Try it live: HuggingFace Space

VoiceTut-TTS is an Egyptian-Arabic text-to-speech system fine-tuned from OmniVoice on ~380 hours of Egyptian podcast speech. It produces natural Egyptian speech with seamless Arabic ↔ English code-switching, ships 15 built-in studio voices, supports zero-shot voice cloning, and includes a robust Egyptian-Arabic text normalization pipeline plus true streaming for long text.

Why "VoiceTut"? Tut — after the boy-king Tutankhamun (توت عنخ آمون) — anchors the model in Egyptian identity, just as our companion ASR model QwenCleo-ASR is named after Cleopatra. Together they form an Egyptian speech stack: Cleo listens, Tut speaks. 🎙️🗣️

✨ Features

🎯 Egyptian-first — fine-tuned specifically on Egyptian Arabic, not generic MSA.
🔀 Code-switching — handles real Arabic + English mixed speech (عندي meeting بكرة).
🗣️ 15 built-in voices — male & female studio speakers, each with style tags.
🧬 Zero-shot cloning — clone any voice from a few seconds of reference audio.
🔢 Robust normalization — numbers, dates, times, currencies, phones, emails, URLs, abbreviations + a diacritics override table and a custom lexicon.
⚡ True streaming — long text is split into sentences and yielded as audio chunks for low time-to-first-audio.
📦 pip-installable — pip install voicetut-tts (+ OmniVoice from GitHub), or clone and run locally.

📊 Performance

Measured on a single NVIDIA T4 (Colab), float16, num_step=32. Reproduce with examples/04_evaluation.ipynb.

Metric	Value
Real-time factor (RTF, mean)	1.13×
RTF (best)	0.49×
Time-to-first-audio (streaming)	1.68 s
Peak VRAM (fp16)	2.93 GB
WER — Egyptian Arabic	0.40
WER — English	0.07
Speaker similarity (cloning, cosine)	0.83
Naturalness (UTMOS, 1–5)	3.47
Sampling rate	24 kHz

Measured on a T4; expect markedly lower RTF / TTFA on an A100 / H100. RTF and TTFA scale with num_step — drop to num_step=16 for faster, slightly lower-quality output.

🗣️ Sample Outputs

🎧 Listen to all built-in voices & VoiceTut vs. base OmniVoice comparisons → — pure Egyptian, code-switching, English, normalization, per-speaker, customer-service, and zero-shot cloning samples. Or try them live in the HuggingFace Space.

📦 Installation

Option A — PyPI

# 1. PyTorch matching your CUDA (see https://pytorch.org)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# 2. The OmniVoice backbone (not on PyPI, so install it from GitHub)
pip install git+https://github.com/k2-fsa/OmniVoice.git
# 3. VoiceTut-TTS
pip install voicetut-tts
# optional: web UI deps
pip install "voicetut-tts[web]"

Option B — from source

git clone https://github.com/MohammedAly22/VoiceTuT-TTS.git
cd VoiceTuT-TTS
conda create -n voicetut python=3.10 -y && conda activate voicetut
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/k2-fsa/OmniVoice.git    # backbone
pip install -e ".[web,dev]"

🚀 Usage

Python API

from voicetut_tts import VoiceTutTTS

tts = VoiceTutTTS.from_pretrained("mohammedaly22/VoiceTut-TTS")

# 1) Built-in speaker
tts.synthesize("ازيك عامل ايه النهاردة؟", speaker="Mohamed", output="out.wav")

# 2) Zero-shot voice cloning
tts.synthesize("النهارده الجو حلو اوي",
               ref_audio="my_voice.wav", ref_text="ده الصوت بتاعي",
               output="clone.wav")

# 3) Generation parameters
tts.synthesize("عندي meeting الساعة 3:30",
               speaker="Asmaa", num_step=48, guidance_scale=2.5, speed=1.1,
               output="cs.wav")

True streaming (long text)

import sounddevice as sd
for sr, chunk in tts.stream(long_paragraph, speaker="Sayed"):
    sd.play(chunk, sr); sd.wait()          # play each sentence as it's ready

# or write a single concatenated file
tts.synthesize_long(long_paragraph, "long.wav", speaker="Sayed")

Text normalization & custom lexicon

from voicetut_tts import ArabicNormalizer

norm = ArabicNormalizer()
norm("عندي 250 جنيه والميعاد 3:30 يوم 14/3/2024")
# -> "عندي ميتين وخمسين جنيه والميعاد تلاتة و نص يوم أربعتاشر مارس ألفين وأربعة وعشرين"

What gets normalized

Input	Output (spoken)
`3:30` / `7:45` / `9:50`	تلاتة و نص / تمانية الا ربع / عشرة الا عشرة (colloquial Egyptian clock)
`01147450629`	زيرو حداشر سبعة وأربعين خمسة وأربعين ستة تسعة وعشرين (Egyptian prefix + 2-digit groups)
`Ahmed` / `Mohamed` / `Mona`	أحمد / محمد / منى (English→Arabic name map)
`250 جنيه` / `75$` / `25%`	ميتين وخمسين جنيه / خمسة وسبعين دولار / خمسة وعشرين في المية
`14/3/2024`	أربعتاشر مارس ألفين وأربعة وعشرين
`a.b@gmail.com`	a dot b at gmail dot com

Override tables & runtime dictionaries

# diacritized-form overrides (win over the CSV table)
tts.add_lexicon({"تيوت": "تُوت", "نايل": "نَايِل"})

# English-name -> Arabic for correct pronunciation
tts.add_names({"Ziad": "زياد", "Kareem": "كريم"})

Editable data tables ship with the package:

data/diacritics.csv — word,diacritized (Arabic word → diacritized form)
data/names_en_ar.csv — english,arabic (name transliteration)

CLI

voicetut --list-speakers
voicetut --text "ازيك عامل ايه؟" --speaker Mohamed --output out.wav
voicetut --text "نص طويل..." --speaker Sayed --stream --output long.wav

🌐 Serving

A custom-styled Gradio web UI (black/white + blue theme, speaker dropdown with gender + style tags, reference preview, voice cloning with mic recording, AR/EN language switch, generation params, examples):

pip install "voicetut-tts[web]"
python app.py                                   # default HF checkpoint, port 7860
OMNICLEO_CKPT=exp/omnivoice_egy/checkpoint-8000 python app.py   # local checkpoint
OMNICLEO_SHARE=1 python app.py                  # public share link

📓 Examples (Colab)

Notebook	Description
01_quickstart.ipynb	Install, load, synthesize with a built-in voice
02_voice_cloning.ipynb	Zero-shot cloning + normalization & lexicon
03_web_ui.ipynb	Launch the Gradio web UI from Colab (public link)
04_evaluation.ipynb	Measure RTF, TTFA, VRAM, WER, similarity, UTMOS

🔗 Links

🤗 Model: https://huggingface.co/mohammedaly22/VoiceTut-TTS
📦 PyPI: https://pypi.org/project/voicetut-tts/
🧠 Base model: https://github.com/k2-fsa/OmniVoice
🎧 Companion ASR: https://github.com/MohammedAly22/qwencleo-asr

📜 License & Citation

Released under the Apache-2.0 license.

@software{voicetut_tts_2026,
  author  = {Mohammed Aly},
  title   = {VoiceTut-TTS: Egyptian Arabic & Code-Switching Text-to-Speech},
  year    = {2026},
  url     = {https://github.com/MohammedAly22/VoiceTuT-TTS},
  note    = {Fine-tuned from OmniVoice}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Jun 19, 2026

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicetut_tts-0.1.1.tar.gz (30.0 kB view details)

Uploaded Jun 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voicetut_tts-0.1.1-py3-none-any.whl (26.9 kB view details)

Uploaded Jun 19, 2026 Python 3

File details

Details for the file voicetut_tts-0.1.1.tar.gz.

File metadata

Download URL: voicetut_tts-0.1.1.tar.gz
Upload date: Jun 19, 2026
Size: 30.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for voicetut_tts-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a555981ef302e62954c0667209cbffebee5ba684a97b0f51e9f43b74f31225c4`
MD5	`516b2485b5a09589781fe097befb6483`
BLAKE2b-256	`1cc3256ff6e7282022bf6a2f505d6ea1ce8346d13053d365351761b8655b5e79`

See more details on using hashes here.

File details

Details for the file voicetut_tts-0.1.1-py3-none-any.whl.

File metadata

Download URL: voicetut_tts-0.1.1-py3-none-any.whl
Upload date: Jun 19, 2026
Size: 26.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for voicetut_tts-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad600428fde721d5f991d6dad348e55a44eb0b08d2bfb499389d553a9e0febee`
MD5	`7781879249db0cc6904dddfe4a6fdc3b`
BLAKE2b-256	`8186aabd64792f174eab82bdb6ff672424f80a86f451194e5f45c3efcf58b001`

See more details on using hashes here.

voicetut-tts 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

𓋹 VoiceTut-TTS

✨ Features

📊 Performance

🗣️ Sample Outputs

📦 Installation

🚀 Usage

Python API

True streaming (long text)

Text normalization & custom lexicon

CLI

🌐 Serving

📓 Examples (Colab)

🔗 Links

📜 License & Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes