Skip to main content

Low-latency Levantine Arabic / English code-switching TTS (fine-tuned XTTS-v2)

Project description

๐ŸŒฟ Leva-TTS

Low-Latency Code-Switching TTS โ€” Levantine Arabic โ‡„ English

A production-oriented Levantine Text-to-Speech pipeline built on a fine-tuned XTTS-v2 optimized for real-time conversational agents.

Demo HF Model HF Space PyPI Open in Colab

๐ŸŽฏ KPI Target Measured Status
Peak VRAM (inference) โ‰ค 3 GB 2.13 GB โœ…
Time-to-First-Audio (p50) < 300 ms 565 ms โš ๏ธ
Real-Time Factor (RTF) < 0.3 0.21 โœ…
Streaming output required chunked PCM + WS โœ…

๐ŸŒŸ Overview

Leva-TTS is a production-ready streaming TTS system that handles natural code-switching between Levantine Arabic dialect and English โ€” the way real speakers actually talk.

It fine-tunes XTTS-v2 (Coqui) on 50,000 high-quality synthetic Levantine Arabic + code-switching utterances generated by Lahgtna-OmniVoice v2 โ€” a zero-shot TTS model already fine-tuned for the Levantine Arabic dialect (ISO 639-3: apc).

โœจ Key Features

Feature Details
๐Ÿ—ฃ๏ธ Natural code-switching Intra-sentence Arabic โ†” English
โšก Streaming output First audio chunk < 300 ms
๐Ÿ’พ Low VRAM โ‰ค 3 GB at inference
๐ŸŒฟ Levantine dialect ู‚โ†’/ส”/ glottal, ุฌโ†’/ส’/, il- article, b- prefix
๐Ÿ”ค Smart text front-end Partial diacritics on homographs + Levantine lexicon CSV
๐Ÿ‘ฅ 10 speakers 5 male + 5 female, diverse Levantine accents
๐Ÿ“ก WebSocket streaming FastAPI server with real-time chunked PCM
๐Ÿ”Œ Pipecat ready Drop-in TTSService for voice agents

๐Ÿ“Š Performance

Measured on a single NVIDIA H100 (fp16) over a 15-sentence held-out set (6 pure Levantine ยท 3 pure English ยท 6 code-switched), speaker Mohamed:

Metric Target Achieved
๐Ÿ’พ Peak VRAM (inference only) โ‰ค 3 GB 2.13 GB โœ…
โšก TTFA โ€” streaming (first chunk) < 300 ms ~565 ms โš ๏ธ
โฑ๏ธ TTFA โ€” batch p50 โ€” 707 ms
๐ŸŽš๏ธ RTF p50 / p95 < 0.3 0.21 / 0.59 โœ… (p50)
๐Ÿ“ก Streaming Required โœ…

Notes: RTF p50 is well under target; longer sentences raise p95. Streaming TTFA (~565 ms) is the time to the first playable audio chunk โ€” XTTS-v2's autoregressive GPT is slower than the 300 ms streaming target on first token, but audio plays continuously thereafter. VRAM excludes the Whisper model used only during evaluation.


๐ŸŽต Audio Samples

๐Ÿ”Š โ–ถ Open the interactive demo page โ†’


โšก Try it on Colab (zero setup)

Run everything on a free Colab T4 GPU โ€” no local install:

Notebook Description
Quick Start Synthesize, zero-shot clone, stream Open In Colab
Inference Server FastAPI streaming server + requests Open In Colab
Evaluation RTF / TTFA / CER / WER / UTMOS on T4 Open In Colab
Gradio App Full web demo with a public link Open In Colab

See examples/ for details.


๐Ÿš€ Getting Started

Leva-TTS supports two usage paths:

Path For whom What you get
A โ€” pip install You only want to synthesize speech The LevaTTS Python class โ€” synthesize, zero_shot_synthesize, stream, zero_shot_stream. The fine-tuned checkpoint + 10 reference speakers download automatically on first use.
B โ€” Clone the repo You want full control โ€” streaming server, Pipecat, Gradio app, fine-tuning Everything in A plus the FastAPI server, the Pipecat plugin, the Gradio demo, the evaluation suite, and the training pipeline.

๐Ÿ“ฆ Path A โ€” pip install (inference only)

1. Create the environment

conda create -n leva-tts python=3.10 -y
conda activate leva-tts

# System audio libraries (Ubuntu/Debian)
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

2. Install PyTorch first

Install PyTorch before leva-tts so pip locks the right CUDA build for your machine. Pick the command that matches your hardware from https://pytorch.org/get-started/locally/

# CUDA 12.1 (most H100 / A100 / RTX setups)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

# CPU only
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

3. Install leva-tts

pip install leva-tts

Engine note: Leva-TTS depends on coqui-tts โ€” the maintained Coqui fork that exposes the same TTS/XTTS modules. The original TTS package is unmaintained and pins numpy==1.22.0, which cannot resolve against modern librosa/numba on Python 3.10+ (the classic ResolutionImpossible error). coqui-tts ships a coherent, numpy-2-compatible dependency set, so a plain pip install leva-tts resolves cleanly.

First synthesis call auto-downloads the fine-tuned checkpoint and the 10 reference speakers from HuggingFace (mohammedaly22/leva-tts), falling back to the GitHub release. To pre-download:

python -c "import leva_tts; leva_tts.download_model()"

4. Initialize

from leva_tts import LevaTTS, SPEAKERS

tts = LevaTTS(
    device="cuda",          # "cuda" | "cpu" (auto-detected if omitted)
    preprocess_text=True,   # Levantine text front-end (numbers, dates, diacritics, lexicon)
    verbose=False,          # print the text-processing stages
)

print(SPEAKERS)
# ['Badr', 'Mohamed', 'Saad', 'Rami', 'Fadi',
#  'Amina', 'Fatma', 'Lamyaa', 'Mona', 'Haneen']

5. Synthesize with a built-in speaker

synthesize(text, speaker, language="ar", **gen_params) returns (wav, sr) โ€” a float32 NumPy array at 24 kHz. speaker must be one of the 10 names above, otherwise a ValueError is raised.

import soundfile as sf

wav, sr = tts.synthesize(
    "ู‡ูŽู„ูŽู‘ู‚ ุฃู†ุง ุนู… ุฃุดุชุบู„ ุนู„ู‰ the project",
    speaker="Badr",
    temperature=0.65,          # generation params are optional per-call
    repetition_penalty=5.0,
    top_p=0.85,
    top_k=50,
    speed=1.0,
)
sf.write("output.wav", wav, sr)   # sr == 24000

6. Zero-shot voice cloning

zero_shot_synthesize(text, reference_audio, language="ar", **gen_params) โ€” same as synthesize, but you pass a path to your own 3โ€“10 s reference clip instead of a built-in speaker name.

wav, sr = tts.zero_shot_synthesize(
    "ูˆุงู„ู„ู‡ the meeting today ูƒุงู†ุช important ูƒุชูŠุฑ",
    "my_voice.wav",
    language="ar",
)
sf.write("cloned.wav", wav, sr)

7. Streaming (generators)

stream(...) and zero_shot_stream(...) mirror the two methods above but yield audio chunks as they are generated โ€” ideal for low-latency playback or sending over a socket.

import numpy as np, soundfile as sf

# Built-in speaker
chunks = []
for chunk in tts.stream("ุจูุฏูู‘ูŠ ุฃุญูƒูŠู„ูƒ ุนู† the new feature ู‡ูŽู„ูŽู‘ู‚", speaker="Amina"):
    chunks.append(chunk)        # play / forward each chunk in real time
sf.write("streamed.wav", np.concatenate(chunks), 24000)

# Zero-shot streaming
for chunk in tts.zero_shot_stream("ู‡ู„ู‚ ุนู… ู†ุดุชุบู„ ุนู„ู‰ ุงู„ู…ูˆุถูˆุน", "my_voice.wav"):
    ...

Generation parameters (all optional, valid on every method): temperature, length_penalty, repetition_penalty, top_k, top_p, speed.


๐Ÿ› ๏ธ Path B โ€” Clone the repo (advanced)

For the streaming server, Pipecat integration, the Gradio app, evaluation, or fine-tuning, clone the repo and create the full conda environment.

1. Clone & create the environment

git clone https://github.com/MohammedAly22/Leva-TTS.git
cd Leva-TTS

# System dependencies
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

# Full conda environment (XTTS, training, server, pipecat, gradio)
conda env create -f environment.yml
conda activate leva-tts
pip install -e .

# Optional โ€” GPU training acceleration
bash scripts/install_deepspeed.sh

Download the checkpoint + reference speakers:

python -c "import leva_tts; leva_tts.download_model('./checkpoints')"

2. Inference (CLI)

# Built-in speaker
python scripts/inference.py --text "ูƒูŠููƒ ุงู„ูŠูˆู…ุŸ" --speaker Amina --out output.wav

# Streaming mode
python scripts/inference.py --text "..." --speaker Badr --stream

# Zero-shot with your own reference audio
python scripts/inference.py --text "..." --ref-audio your_speaker.wav --out clone.wav

3. FastAPI streaming server

# Start the server
LEVA_CHECKPOINT=./checkpoints LEVA_SPEAKER_WAV=./reference_audios/Badr.wav python -m leva_tts.server.app

# Health check
curl http://localhost:8000/health

# Batch synthesize
curl -X POST http://localhost:8000/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text":"ูƒูŠููƒ ุงู„ูŠูˆู…ุŸ","language":"ar","format":"wav"}' \
  --output output.wav

Endpoints: POST /synthesize (WAV/PCM), WS /stream (real-time chunks), GET /health, GET /metrics.

4. Pipecat integration

from leva_tts.pipecat_plugin import LevaTTSService
from pipecat.pipeline.pipeline import Pipeline

# Local GPU mode
tts = LevaTTSService(
    mode="local",
    checkpoint="./checkpoints",
    speaker_wav="./reference_audios/Badr.wav",
    language="ar",
)

# Remote WebSocket mode (points at the streaming server above)
tts_remote = LevaTTSService(
    mode="remote",
    server_url="ws://localhost:8000/stream",
    language="ar",
)

pipeline = Pipeline([..., tts, ...])

The service emits TTSStartedFrame โ†’ TTSAudioRawFrame(s) โ†’ TTSStoppedFrame, streaming audio chunk-by-chunk for conversational latency.

5. Gradio demo

python app.py
# Open http://localhost:7860

Features: ๐ŸŽค 10-speaker dropdown with reference playback ยท ๐Ÿ“ processed-text preview (see exactly what XTTS-v2 receives) ยท ๐ŸŽต batch synthesis with TTFA / RTF / VRAM metrics ยท ๐ŸŽ™๏ธ zero-shot upload (any 3โ€“10 s clip) ยท ๐Ÿ’ก pre-loaded code-switching examples.

6. Fine-tuning

The full data pipeline (50K synthetic utterances via Lahgtna-OmniVoice v2) and the XTTS-v2 fine-tuning steps are documented in the Data Pipeline section below.

python scripts/prepare_data.py --metadata data/metadata.csv --out data/
python scripts/train.py --config configs/<YOUR_TRAINING_CONFIG>.json

๐Ÿ“Š Evaluation

python scripts/evaluate.py --checkpoint checkpoints/

# Skip ASR (faster)
python scripts/evaluate.py --checkpoint checkpoints/ --no-asr

Reports:

  • TTFA p50/p95, RTF p50/p95, Peak VRAM
  • CER/WER via Whisper large-v3 ASR round-trip
  • UTMOS (reference-free neural MOS)
  • Per-type breakdown: pure_levantine / pure_english / code_switching

Results (speaker Mohamed, NVIDIA H100, Whisper large-v3 round-trip)

Overall

Metric Value
Peak VRAM (inference) 2.13 GB
RTF p50 / p95 0.36 / 0.53
TTFA p50 / p95 (batch) 1194 / 1743 ms
TTFA streaming (first chunk) ~565 ms
CER (mean) 0.255
WER (mean) 0.496
UTMOS (reference-free MOS) 3.13 / 5.0

Per-category (intelligibility via ASR round-trip)

Category n CER โ†“ WER โ†“ RTF โ†“ UTMOS โ†‘
Pure English 3 0.144 0.190 0.365 3.35
Pure Levantine Arabic 6 0.236 0.544 0.412 2.97
Code-Switching 6 0.330 0.602 0.358 3.19

Pure English achieves the lowest CER/WER, confirming English quality is well retained. Arabic CER/WER are higher partly because Whisper large-v3 transcribes MSA-normalized Arabic while the references keep Levantine spelling and partial diacritics โ€” so a fraction of the "errors" are orthographic differences, not pronunciation errors. Code-switching is the hardest case (language boundaries), as expected.

โšก v2 โ€” Inference Optimization (TF32 + torch.compile)

We provide an optimized inference path that enables TF32 matmul (Hopper/Ampere) and torch.compile (reduce-overhead) on the autoregressive GPT โ€” the main latency bottleneck. Run it with:

# Baseline
python scripts/evaluate.py --checkpoint checkpoints --tag default

# Optimized (fp16 GPT path + TF32 + compiled kernels)
python scripts/evaluate.py --checkpoint checkpoints --tag optimized --optimize

Default vs Optimized (same 15-sentence set, speaker Mohamed, H100):

Metric Default Optimized ฮ”
RTF p50 0.362 0.355 โˆ’1.9%
RTF p95 0.528 0.494 โˆ’6.4%
TTFA p50 (ms) 1194 1150 โˆ’44 ms
UTMOS โ†‘ 3.13 3.24 +3.5%
CER 0.255 0.173 (within sampling variance)

The optimization lowers RTF (p95 โˆ’6.4%) and TTFA while slightly improving UTMOS โ€” quality is preserved. The CER/WER spread between runs is dominated by the sampling temperature (0.65), not the optimization.

Tried & rejected: Full fp16 on the HiFi-GAN decoder broke the fp32 conv filters in the speaker encoder (dtype mismatch). ONNX export of the autoregressive GPT is non-trivial (KV-cache + dynamic loop) and gave no reliable speedup over torch.compile for streaming, so TF32 + compile is the recommended path.


๐Ÿ—๏ธ Data Pipeline

Step 1 โ€” Text collection (50K sentences)

python scripts/gather_levantine_text.py
# โ†’ data/levantine_50k.txt

Sources:

  • GU-CLASP Shami Corpus โ€” 60K real Levantine sentences (Syrian, Lebanese, Palestinian, Jordanian)
  • Synthetic code-switching templates (35K+ unique combinations)

Text processing pipeline:

Raw text
  โ†’ Unicode NFC + tatweel removal
  โ†’ Number verbalization (Levantine: ู…ูŠุฉ not ู…ุฆุฉ, ุชู„ุงุชุฉ, etc.)
  โ†’ ู‡ โ†’ ุฉ correction (nouns/adjectives, names โ€” preserves ูˆุงู„ู„ู‡, pronoun suffixes)
  โ†’ Partial diacritics on homographs (ุถูŽู„ู‘, ู‡ูŽู„ูŽู‘ู‚, ู…ูุดู’, ุจูุฏูู‘ูŠ, etc.)
  โ†’ Levantine lexicon CSV overrides (148 entries)

Step 2 โ€” Audio synthesis with Lahgtna-OmniVoice v2

python scripts/generate_lahgetna_data.py
# โ†’ data/synthetic_data/wavs/<spk_id>/*.wav  +  metadata.csv
Property Value
Model oddadmix/lahgtna-omnivoice-v2
Language apc โ€” North Levantine Arabic (ISO 639-3)
Speakers 10 (5M + 5F), 5,000 utterances each
Generation params temperature=0.7, top_p=0.7, repetition_penalty=1.2

Step 3 โ€” Data preparation

python scripts/prepare_dataset.py --skip_download

Final training data:

Source Language Utterances Est. Hours
Lahgtna synthetic (primary) Levantine AR + EN CS 50,000 ~70 h
LibriSpeech clean-100 English 5,888 ~20 h
Total 55,888 ~90 h

Step 4 โ€” Fine-tuning XTTS-v2

CUDA_VISIBLE_DEVICES=0 python scripts/train.py
# Monitor:
tensorboard --logdir checkpoints/tensorboard --port 6006

Training config (configs/finetune_xtts.yaml):

Parameter Value
Model XTTS-v2 GPT backbone
Optimizer AdamW, lr=5e-6
Batch size 4 (grad_accum=8 โ†’ effective 32)
Epochs 30
Checkpoint Every 2,000 steps

๐Ÿ‘ฅ Speakers

# ID Name Gender
1 spk_01_male Badr Male
2 spk_02_male Mohamed Male
3 spk_03_male Saad Male
4 spk_04_male Rami Male
5 spk_05_male Fadi Male
6 spk_06_female Amina Female
7 spk_07_female Fatma Female
8 spk_08_female Lamyaa Female
9 spk_09_female Mona Female
10 spk_10_female Haneen Female

๐Ÿ—๏ธ Architecture

Why XTTS-v2?

Requirement XTTS-v2 F5-TTS Kokoro
Native Arabic โœ… โŒ โŒ
Code-switching โœ… โœ… โŒ
Native streaming โœ… โŒ partial
RTF < 0.3 โœ… โŒ (real ~3.0) โœ…
VRAM โ‰ค 3 GB โœ… โŒ โœ…

๐Ÿ“ Project Structure

leva-tts/
โ”œโ”€โ”€ leva_tts/
โ”‚   โ”œโ”€โ”€ text/
โ”‚   โ”‚   โ”œโ”€โ”€ processor.py       โ† TextProcessor (normalization + lexicon)
โ”‚   โ”‚   โ””โ”€โ”€ lexicon.py         โ† CSV loader
โ”‚   โ”œโ”€โ”€ inference/
โ”‚   โ”‚   โ””โ”€โ”€ engine.py          โ† LevaTTSEngine (streaming, DeepSpeed)
โ”‚   โ”œโ”€โ”€ server/
โ”‚   โ”‚   โ””โ”€โ”€ app.py             โ† FastAPI (POST /synthesize, WS /stream)
โ”‚   โ”œโ”€โ”€ pipecat_plugin/
โ”‚   โ”‚   โ””โ”€โ”€ leva_tts_service.py โ† Pipecat TTSService
โ”‚   โ””โ”€โ”€ training/
โ”‚       โ””โ”€โ”€ finetune.py        โ† XTTS-v2 GPT fine-tune
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ train.py               โ† Fine-tuning
โ”‚   โ”œโ”€โ”€ inference.py           โ† CLI synthesis (rich UI)
โ”‚   โ””โ”€โ”€ evaluate.py            โ† Full evaluation suite
โ”œโ”€โ”€ configs/
โ”‚   โ””โ”€โ”€ finetune_xtts.yaml
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ levantine_lexicon.csv  โ† 148 Levantine dialect overrides
โ”œโ”€โ”€ reference_audios/
โ”‚   โ”œโ”€โ”€ references.json        โ† 10 speaker reference configs
โ”‚   โ””โ”€โ”€ *.wav / *.mp3          โ† Reference recordings
โ””โ”€โ”€ app.py                     โ† Gradio demo

๐Ÿ“„ License

  • Code (this repository and the leva-tts package): Apache-2.0 โ€” see LICENSE.
  • Model weights (mohammedaly22/leva-tts on HuggingFace): the fine-tuned XTTS-v2 weights inherit Coqui's non-commercial license (CPML) from the base model and are for research / non-commercial use.

๐Ÿ“œ Citation

@misc{leva-tts-2026,
  title   = {Leva-TTS: Levantine Arabic / English Code-Switching TTS},
  author  = {Mohammed Aly},
  year    = {2026},
  url     = {https://huggingface.co/mohammedaly22},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leva_tts-0.1.6.tar.gz (71.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leva_tts-0.1.6-py3-none-any.whl (66.1 kB view details)

Uploaded Python 3

File details

Details for the file leva_tts-0.1.6.tar.gz.

File metadata

  • Download URL: leva_tts-0.1.6.tar.gz
  • Upload date:
  • Size: 71.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for leva_tts-0.1.6.tar.gz
Algorithm Hash digest
SHA256 a11d14ef22313ad32185374547b14ab1f58c742b2a3e8b20e4889c3b650f6f4f
MD5 356898455a743c25b2b1bfdf4cde5c41
BLAKE2b-256 b18a8fb422b88d143c8bbe8421a3236cf3e1ecd4af99dbca6353416ddb2bc33b

See more details on using hashes here.

File details

Details for the file leva_tts-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: leva_tts-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 66.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for leva_tts-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 735b843304bf31446f015daa8d16a7f14d0cd81543ad3da04cd991bf1dabc34d
MD5 67df72951324ef83fe59a54614cae789
BLAKE2b-256 53e7d955aea01ab99aaaf2cf2f51ee0bfa6152d8b950f87753d92dc9c0613df7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page