Low-latency Levantine Arabic / English code-switching TTS (fine-tuned XTTS-v2)
Project description
๐ฟ Leva-TTS
Low-Latency Code-Switching TTS โ Levantine Arabic โ English
A production-oriented Levantine Text-to-Speech pipeline built on a fine-tuned XTTS-v2 optimized for real-time conversational agents.
| ๐ฏ KPI | Target | Measured | Status |
|---|---|---|---|
| Peak VRAM (inference) | โค 3 GB | 2.13 GB | โ |
| Time-to-First-Audio (p50) | < 300 ms | 565 ms | โ ๏ธ |
| Real-Time Factor (RTF) | < 0.3 | 0.21 | โ |
| Streaming output | required | chunked PCM + WS | โ |
๐ Overview
Leva-TTS is a production-ready streaming TTS system that handles natural code-switching between Levantine Arabic dialect and English โ the way real speakers actually talk.
It fine-tunes XTTS-v2 (Coqui) on 50,000 high-quality synthetic Levantine Arabic + code-switching utterances generated by Lahgtna-OmniVoice v2 โ a zero-shot TTS model already fine-tuned for the Levantine Arabic dialect (ISO 639-3: apc).
โจ Key Features
| Feature | Details |
|---|---|
| ๐ฃ๏ธ Natural code-switching | Intra-sentence Arabic โ English |
| โก Streaming output | First audio chunk < 300 ms |
| ๐พ Low VRAM | โค 3 GB at inference |
| ๐ฟ Levantine dialect | ูโ/ส/ glottal, ุฌโ/ส/, il- article, b- prefix |
| ๐ค Smart text front-end | Partial diacritics on homographs + Levantine lexicon CSV |
| ๐ฅ 10 speakers | 5 male + 5 female, diverse Levantine accents |
| ๐ก WebSocket streaming | FastAPI server with real-time chunked PCM |
| ๐ Pipecat ready | Drop-in TTSService for voice agents |
๐ Performance
Measured on a single NVIDIA H100 (fp16) over a 15-sentence held-out set (6 pure Levantine ยท 3 pure English ยท 6 code-switched), speaker Mohamed:
| Metric | Target | Achieved |
|---|---|---|
| ๐พ Peak VRAM (inference only) | โค 3 GB | 2.13 GB โ |
| โก TTFA โ streaming (first chunk) | < 300 ms | ~565 ms โ ๏ธ |
| โฑ๏ธ TTFA โ batch p50 | โ | 707 ms |
| ๐๏ธ RTF p50 / p95 | < 0.3 | 0.21 / 0.59 โ (p50) |
| ๐ก Streaming | Required | โ |
Notes: RTF p50 is well under target; longer sentences raise p95. Streaming TTFA (~565 ms) is the time to the first playable audio chunk โ XTTS-v2's autoregressive GPT is slower than the 300 ms streaming target on first token, but audio plays continuously thereafter. VRAM excludes the Whisper model used only during evaluation.
๐ต Audio Samples
๐ โถ Open the interactive demo page โ
Embedded audio players don't run inside GitHub Markdown, so the comparisons live on a GitHub Pages demo with real, playable
<audio>players. The tables below link to each clip (click โถ to play). Progression: Base XTTS-v2 โ Lahgtna v2 โ Leva-TTS.
Model Comparison
| Text | Speaker | Base XTTS-v2 | Lahgtna v2 | ๐ข Leva-TTS |
|---|---|---|---|---|
| ูููู ุงูููู ุ ุฅูุช ุดู ุนู ุชุนู ูุ | Mohamed (M) | โถ | โถ | โถ |
| ุดู ุฑุฃูู ูุนู ู brainstorming session ูุจู ุงูู meetingุ | Mohamed (M) | โถ | โถ | โถ |
๐ Code-Switching (Levantine + English)
| Text | Speaker | Base XTTS-v2 | Lahgtna v2 | ๐ข Leva-TTS |
|---|---|---|---|---|
| ูููููู ุฃูุง ุนู ุฃุดุชุบู ุนูู the new project. | Badr (M) | โถ | โถ | โถ |
| ูุงููู the weather today ูุชูุฑ ุญูู. | Fatma (F) | โถ | โถ | โถ |
| ุจูุฏููู ุฃุญูููู ุนู the meeting ุงูู ูู ุงูููู . | Mona (F) | โถ | โถ | โถ |
๐ฃ๏ธ Pure Levantine Arabic
| Text | Speaker | Base XTTS-v2 | Lahgtna v2 | ๐ข Leva-TTS |
|---|---|---|---|---|
| ูููู ุงูููู ุ ุฅูุช ุดู ุนู ุชุนู ู ููููููุ | Badr (M) | โถ | โถ | โถ |
| ูููููู ุฑุญ ุฃุฑูุญ ุนูู ุงูุจูุช ูุจูุฑุง ุจุฑุฌุน. | Amina (F) | โถ | โถ | โถ |
| ุดู ุฑุฃูู ูุทูุน ูุชู ุดู ุดูู ุจุนุฏ ุงูุดุบูุ | Rami (M) | โถ | โถ | โถ |
๐ฌ๐ง Pure English
| Text | Speaker | Base XTTS-v2 | Lahgtna v2 | ๐ข Leva-TTS |
|---|---|---|---|---|
| Hello, how are you doing today? | Lamyaa (F) | โถ | โถ | โถ |
| The project deadline is next Friday. | Mohamed (M) | โถ | โถ | โถ |
๐ Generate your own:
python scripts/inference.py --text "your text"
โก Try it on Colab (zero setup)
Run everything on a free Colab T4 GPU โ no local install:
See examples/ for details.
๐ Getting Started
Leva-TTS supports two usage paths:
| Path | For whom | What you get |
|---|---|---|
A โ pip install |
You only want to synthesize speech | The LevaTTS Python class โ synthesize, zero_shot_synthesize, stream, zero_shot_stream. The fine-tuned checkpoint + 10 reference speakers download automatically on first use. |
| B โ Clone the repo | You want full control โ streaming server, Pipecat, Gradio app, fine-tuning | Everything in A plus the FastAPI server, the Pipecat plugin, the Gradio demo, the evaluation suite, and the training pipeline. |
๐ฆ Path A โ pip install (inference only)
1. Create the environment
conda create -n leva-tts python=3.10 -y
conda activate leva-tts
# System audio libraries (Ubuntu/Debian)
sudo apt-get install -y espeak-ng ffmpeg libsndfile1
pip install leva-tts
First synthesis call auto-downloads the fine-tuned checkpoint and the 10 reference speakers from HuggingFace (
mohammedaly22/leva-tts), falling back to the GitHub release. To pre-download:python -c "import leva_tts; leva_tts.download_model()"
2. Initialize
from leva_tts import LevaTTS, SPEAKERS
tts = LevaTTS(
device="cuda", # "cuda" | "cpu" (auto-detected if omitted)
preprocess_text=True, # Levantine text front-end (numbers, dates, diacritics, lexicon)
verbose=False, # print the text-processing stages
)
print(SPEAKERS)
# ['Badr', 'Mohamed', 'Saad', 'Rami', 'Fadi',
# 'Amina', 'Fatma', 'Lamyaa', 'Mona', 'Haneen']
3. Synthesize with a built-in speaker
synthesize(text, speaker, language="ar", **gen_params) returns (wav, sr) โ
a float32 NumPy array at 24 kHz. speaker must be one of the 10 names above,
otherwise a ValueError is raised.
import soundfile as sf
wav, sr = tts.synthesize(
"ูููููู ุฃูุง ุนู
ุฃุดุชุบู ุนูู the project",
speaker="Badr",
temperature=0.65, # generation params are optional per-call
repetition_penalty=5.0,
top_p=0.85,
top_k=50,
speed=1.0,
)
sf.write("output.wav", wav, sr) # sr == 24000
4. Zero-shot voice cloning
zero_shot_synthesize(text, reference_audio, language="ar", **gen_params) โ
same as synthesize, but you pass a path to your own 3โ10 s reference clip
instead of a built-in speaker name.
wav, sr = tts.zero_shot_synthesize(
"ูุงููู the meeting today ูุงูุช important ูุชูุฑ",
"my_voice.wav",
language="ar",
)
sf.write("cloned.wav", wav, sr)
5. Streaming (generators)
stream(...) and zero_shot_stream(...) mirror the two methods above but
yield audio chunks as they are generated โ ideal for low-latency playback or
sending over a socket.
import numpy as np, soundfile as sf
# Built-in speaker
chunks = []
for chunk in tts.stream("ุจูุฏููู ุฃุญูููู ุนู the new feature ูููููู", speaker="Amina"):
chunks.append(chunk) # play / forward each chunk in real time
sf.write("streamed.wav", np.concatenate(chunks), 24000)
# Zero-shot streaming
for chunk in tts.zero_shot_stream("ููู ุนู
ูุดุชุบู ุนูู ุงูู
ูุถูุน", "my_voice.wav"):
...
Generation parameters (all optional, valid on every method): temperature,
length_penalty, repetition_penalty, top_k, top_p, speed.
๐ ๏ธ Path B โ Clone the repo (advanced)
For the streaming server, Pipecat integration, the Gradio app, evaluation, or fine-tuning, clone the repo and create the full conda environment.
1. Clone & create the environment
git clone https://github.com/MohammedAly22/Leva-TTS.git
cd Leva-TTS
# System dependencies
sudo apt-get install -y espeak-ng ffmpeg libsndfile1
# Full conda environment (XTTS, training, server, pipecat, gradio)
conda env create -f environment.yml
conda activate leva-tts
pip install -e .
# Optional โ GPU training acceleration
bash scripts/install_deepspeed.sh
Download the checkpoint + reference speakers:
python -c "import leva_tts; leva_tts.download_model('./checkpoints')"
2. Inference (CLI)
# Built-in speaker
python scripts/inference.py --text "ูููู ุงูููู
ุ" --speaker Amina --out output.wav
# Streaming mode
python scripts/inference.py --text "..." --speaker Badr --stream
# Zero-shot with your own reference audio
python scripts/inference.py --text "..." --ref-audio your_speaker.wav --out clone.wav
3. FastAPI streaming server
# Start the server
LEVA_CHECKPOINT=./checkpoints LEVA_SPEAKER_WAV=./reference_audios/Badr.wav python -m leva_tts.server.app
# Health check
curl http://localhost:8000/health
# Batch synthesize
curl -X POST http://localhost:8000/synthesize \
-H "Content-Type: application/json" \
-d '{"text":"ูููู ุงูููู
ุ","language":"ar","format":"wav"}' \
--output output.wav
Endpoints: POST /synthesize (WAV/PCM), WS /stream (real-time chunks),
GET /health, GET /metrics.
4. Pipecat integration
from leva_tts.pipecat_plugin import LevaTTSService
from pipecat.pipeline.pipeline import Pipeline
# Local GPU mode
tts = LevaTTSService(
mode="local",
checkpoint="./checkpoints",
speaker_wav="./reference_audios/Badr.wav",
language="ar",
)
# Remote WebSocket mode (points at the streaming server above)
tts_remote = LevaTTSService(
mode="remote",
server_url="ws://localhost:8000/stream",
language="ar",
)
pipeline = Pipeline([..., tts, ...])
The service emits TTSStartedFrame โ TTSAudioRawFrame(s) โ TTSStoppedFrame,
streaming audio chunk-by-chunk for conversational latency.
5. Gradio demo
python app.py
# Open http://localhost:7860
Features: ๐ค 10-speaker dropdown with reference playback ยท ๐ processed-text preview (see exactly what XTTS-v2 receives) ยท ๐ต batch synthesis with TTFA / RTF / VRAM metrics ยท ๐๏ธ zero-shot upload (any 3โ10 s clip) ยท ๐ก pre-loaded code-switching examples.
6. Fine-tuning
The full data pipeline (50K synthetic utterances via Lahgtna-OmniVoice v2) and the XTTS-v2 fine-tuning steps are documented in the Data Pipeline section below.
python scripts/prepare_data.py --metadata data/metadata.csv --out data/
python scripts/train.py --config configs/train_config.json
๐ Evaluation
python scripts/evaluate.py --checkpoint checkpoints/
# Skip ASR (faster)
python scripts/evaluate.py --checkpoint checkpoints/ --no-asr
Reports:
- TTFA p50/p95, RTF p50/p95, Peak VRAM
- CER/WER via Whisper large-v3 ASR round-trip
- UTMOS (reference-free neural MOS)
- Per-type breakdown: pure_levantine / pure_english / code_switching
Results (speaker Mohamed, NVIDIA H100, Whisper large-v3 round-trip)
Overall
| Metric | Value |
|---|---|
| Peak VRAM (inference) | 2.13 GB |
| RTF p50 / p95 | 0.36 / 0.53 |
| TTFA p50 / p95 (batch) | 1194 / 1743 ms |
| TTFA streaming (first chunk) | ~565 ms |
| CER (mean) | 0.255 |
| WER (mean) | 0.496 |
| UTMOS (reference-free MOS) | 3.13 / 5.0 |
Per-category (intelligibility via ASR round-trip)
| Category | n | CER โ | WER โ | RTF โ | UTMOS โ |
|---|---|---|---|---|---|
| Pure English | 3 | 0.144 | 0.190 | 0.365 | 3.35 |
| Pure Levantine Arabic | 6 | 0.236 | 0.544 | 0.412 | 2.97 |
| Code-Switching | 6 | 0.330 | 0.602 | 0.358 | 3.19 |
Pure English achieves the lowest CER/WER, confirming English quality is well retained. Arabic CER/WER are higher partly because Whisper large-v3 transcribes MSA-normalized Arabic while the references keep Levantine spelling and partial diacritics โ so a fraction of the "errors" are orthographic differences, not pronunciation errors. Code-switching is the hardest case (language boundaries), as expected.
โก v2 โ Inference Optimization (TF32 + torch.compile)
We provide an optimized inference path that enables TF32 matmul (Hopper/Ampere)
and torch.compile (reduce-overhead) on the autoregressive GPT โ the main
latency bottleneck. Run it with:
# Baseline
python scripts/evaluate.py --checkpoint checkpoints --tag default
# Optimized (fp16 GPT path + TF32 + compiled kernels)
python scripts/evaluate.py --checkpoint checkpoints --tag optimized --optimize
Default vs Optimized (same 15-sentence set, speaker Mohamed, H100):
| Metric | Default | Optimized | ฮ |
|---|---|---|---|
| RTF p50 | 0.362 | 0.355 | โ1.9% |
| RTF p95 | 0.528 | 0.494 | โ6.4% |
| TTFA p50 (ms) | 1194 | 1150 | โ44 ms |
| UTMOS โ | 3.13 | 3.24 | +3.5% |
| CER | 0.255 | 0.173 | (within sampling variance) |
The optimization lowers RTF (p95 โ6.4%) and TTFA while slightly improving UTMOS โ quality is preserved. The CER/WER spread between runs is dominated by the sampling temperature (0.65), not the optimization.
Tried & rejected: Full fp16 on the HiFi-GAN decoder broke the fp32 conv filters in the speaker encoder (dtype mismatch). ONNX export of the autoregressive GPT is non-trivial (KV-cache + dynamic loop) and gave no reliable speedup over
torch.compilefor streaming, so TF32 + compile is the recommended path.
๐๏ธ Data Pipeline
Step 1 โ Text collection (50K sentences)
python scripts/gather_levantine_text.py
# โ data/levantine_50k.txt
Sources:
- GU-CLASP Shami Corpus โ 60K real Levantine sentences (Syrian, Lebanese, Palestinian, Jordanian)
- Synthetic code-switching templates (35K+ unique combinations)
Text processing pipeline:
Raw text
โ Unicode NFC + tatweel removal
โ Number verbalization (Levantine: ู
ูุฉ not ู
ุฆุฉ, ุชูุงุชุฉ, etc.)
โ ู โ ุฉ correction (nouns/adjectives, names โ preserves ูุงููู, pronoun suffixes)
โ Partial diacritics on homographs (ุถููู, ูููููู, ู
ูุดู, ุจูุฏููู, etc.)
โ Levantine lexicon CSV overrides (148 entries)
Step 2 โ Audio synthesis with Lahgtna-OmniVoice v2
python scripts/generate_lahgetna_data.py
# โ data/synthetic_data/wavs/<spk_id>/*.wav + metadata.csv
| Property | Value |
|---|---|
| Model | oddadmix/lahgtna-omnivoice-v2 |
| Language | apc โ North Levantine Arabic (ISO 639-3) |
| Speakers | 10 (5M + 5F), 5,000 utterances each |
| Generation params | temperature=0.7, top_p=0.7, repetition_penalty=1.2 |
Step 3 โ Data preparation
python scripts/prepare_dataset.py --skip_download
Final training data:
| Source | Language | Utterances | Est. Hours |
|---|---|---|---|
| Lahgtna synthetic (primary) | Levantine AR + EN CS | 50,000 | ~70 h |
| LibriSpeech clean-100 | English | 5,888 | ~20 h |
| Total | 55,888 | ~90 h |
Step 4 โ Fine-tuning XTTS-v2
CUDA_VISIBLE_DEVICES=0 python scripts/train.py
# Monitor:
tensorboard --logdir checkpoints/tensorboard --port 6006
Training config (configs/finetune_xtts.yaml):
| Parameter | Value |
|---|---|
| Model | XTTS-v2 GPT backbone |
| Optimizer | AdamW, lr=5e-6 |
| Batch size | 4 (grad_accum=8 โ effective 32) |
| Epochs | 30 |
| Checkpoint | Every 2,000 steps |
๐ฅ Speakers
| # | ID | Name | Gender |
|---|---|---|---|
| 1 | spk_01_male | Badr | Male |
| 2 | spk_02_male | Mohamed | Male |
| 3 | spk_03_male | Saad | Male |
| 4 | spk_04_male | Rami | Male |
| 5 | spk_05_male | Fadi | Male |
| 6 | spk_06_female | Amina | Female |
| 7 | spk_07_female | Fatma | Female |
| 8 | spk_08_female | Lamyaa | Female |
| 9 | spk_09_female | Mona | Female |
| 10 | spk_10_female | Haneen | Female |
๐๏ธ Architecture
Why XTTS-v2?
| Requirement | XTTS-v2 | F5-TTS | Kokoro |
|---|---|---|---|
| Native Arabic | โ | โ | โ |
| Code-switching | โ | โ | โ |
| Native streaming | โ | โ | partial |
| RTF < 0.3 | โ | โ (real ~3.0) | โ |
| VRAM โค 3 GB | โ | โ | โ |
๐ Project Structure
leva-tts/
โโโ leva_tts/
โ โโโ text/
โ โ โโโ processor.py โ TextProcessor (normalization + lexicon)
โ โ โโโ lexicon.py โ CSV loader
โ โโโ inference/
โ โ โโโ engine.py โ LevaTTSEngine (streaming, DeepSpeed)
โ โโโ server/
โ โ โโโ app.py โ FastAPI (POST /synthesize, WS /stream)
โ โโโ pipecat_plugin/
โ โ โโโ leva_tts_service.py โ Pipecat TTSService
โ โโโ training/
โ โโโ finetune.py โ XTTS-v2 GPT fine-tune
โโโ scripts/
โ โโโ train.py โ Fine-tuning
โ โโโ inference.py โ CLI synthesis (rich UI)
โ โโโ evaluate.py โ Full evaluation suite
โโโ configs/
โ โโโ finetune_xtts.yaml
โโโ data/
โ โโโ levantine_lexicon.csv โ 148 Levantine dialect overrides
โโโ reference_audios/
โ โโโ references.json โ 10 speaker reference configs
โ โโโ *.wav / *.mp3 โ Reference recordings
โโโ app.py โ Gradio demo
๐ License
- Code (this repository and the
leva-ttspackage): Apache-2.0 โ seeLICENSE. - Model weights (
mohammedaly22/leva-ttson HuggingFace): the fine-tuned XTTS-v2 weights inherit Coqui's non-commercial license (CPML) from the base model and are for research / non-commercial use.
๐ Citation
@misc{leva-tts-2026,
title = {Leva-TTS: Levantine Arabic / English Code-Switching TTS},
author = {Mohammed Aly},
year = {2026},
url = {https://huggingface.co/mohammedaly22},
}
Built with โค๏ธ for natural Levantine Arabic speech synthesis
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leva_tts-0.1.3.tar.gz.
File metadata
- Download URL: leva_tts-0.1.3.tar.gz
- Upload date:
- Size: 71.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
468aca56af6019b38b53be40b6416a3a267afb027aefc572fb16ccf4b303450e
|
|
| MD5 |
b5f1b819fc6c0500e857fb7c6ac1f79c
|
|
| BLAKE2b-256 |
22706483410ae541b616dcbc0890a97b8e54ce16b5556e7b6affbe84f8ee77ee
|
File details
Details for the file leva_tts-0.1.3-py3-none-any.whl.
File metadata
- Download URL: leva_tts-0.1.3-py3-none-any.whl
- Upload date:
- Size: 66.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e5f83abc210bba3223aa7ca1facde0a391ab14dbcab6fbf1c27982a34cabab1
|
|
| MD5 |
cfec250aae9bd843fde52abd3a77e873
|
|
| BLAKE2b-256 |
340395dac32a25457b7254f8c67344bf9451435cacb2f3bf17b3aaaaf7fb9754
|