Skip to main content

Low-latency Levantine Arabic / English code-switching TTS (fine-tuned XTTS-v2)

Project description

๐ŸŒฟ Leva-TTS

Low-Latency Code-Switching TTS โ€” Levantine Arabic โ‡„ English

A production-oriented Levantine Text-to-Speech pipeline built on a fine-tuned XTTS-v2 optimized for real-time conversational agents.

Python PyTorch FastAPI Pipecat License

Demo HF Model HF Space PyPI Open in Colab

๐ŸŽฏ KPI Target Measured Status
Peak VRAM (inference) โ‰ค 3 GB 2.13 GB โœ…
Time-to-First-Audio (p50) < 300 ms 565 ms โš ๏ธ
Real-Time Factor (RTF) < 0.3 0.21 โœ…
Streaming output required chunked PCM + WS โœ…

๐ŸŒŸ Overview

Leva-TTS is a production-ready streaming TTS system that handles natural code-switching between Levantine Arabic dialect and English โ€” the way real speakers actually talk.

It fine-tunes XTTS-v2 (Coqui) on 50,000 high-quality synthetic Levantine Arabic + code-switching utterances generated by Lahgtna-OmniVoice v2 โ€” a zero-shot TTS model already fine-tuned for the Levantine Arabic dialect (ISO 639-3: apc).

โœจ Key Features

Feature Details
๐Ÿ—ฃ๏ธ Natural code-switching Intra-sentence Arabic โ†” English
โšก Streaming output First audio chunk < 300 ms
๐Ÿ’พ Low VRAM โ‰ค 3 GB at inference
๐ŸŒฟ Levantine dialect ู‚โ†’/ส”/ glottal, ุฌโ†’/ส’/, il- article, b- prefix
๐Ÿ”ค Smart text front-end Partial diacritics on homographs + Levantine lexicon CSV
๐Ÿ‘ฅ 10 speakers 5 male + 5 female, diverse Levantine accents
๐Ÿ“ก WebSocket streaming FastAPI server with real-time chunked PCM
๐Ÿ”Œ Pipecat ready Drop-in TTSService for voice agents

๐Ÿ“Š Performance

Measured on a single NVIDIA H100 (fp16) over a 15-sentence held-out set (6 pure Levantine ยท 3 pure English ยท 6 code-switched), speaker Mohamed:

Metric Target Achieved
๐Ÿ’พ Peak VRAM (inference only) โ‰ค 3 GB 2.13 GB โœ…
โšก TTFA โ€” streaming (first chunk) < 300 ms ~565 ms โš ๏ธ
โฑ๏ธ TTFA โ€” batch p50 โ€” 707 ms
๐ŸŽš๏ธ RTF p50 / p95 < 0.3 0.21 / 0.59 โœ… (p50)
๐Ÿ“ก Streaming Required โœ…

Notes: RTF p50 is well under target; longer sentences raise p95. Streaming TTFA (~565 ms) is the time to the first playable audio chunk โ€” XTTS-v2's autoregressive GPT is slower than the 300 ms streaming target on first token, but audio plays continuously thereafter. VRAM excludes the Whisper model used only during evaluation.


๐ŸŽต Audio Samples

๐Ÿ”Š โ–ถ Open the interactive demo page โ†’

Embedded audio players don't run inside GitHub Markdown, so the comparisons live on a GitHub Pages demo with real, playable <audio> players. The tables below link to each clip (click โ–ถ to play). Progression: Base XTTS-v2 โ†’ Lahgtna v2 โ†’ Leva-TTS.

Model Comparison

Text Speaker Base XTTS-v2 Lahgtna v2 ๐ŸŸข Leva-TTS
ูƒูŠููƒ ุงู„ูŠูˆู…ุŸ ุฅู†ุช ุดูˆ ุนู… ุชุนู…ู„ุŸ Mohamed (M) โ–ถ โ–ถ โ–ถ
ุดูˆ ุฑุฃูŠูƒ ู†ุนู…ู„ brainstorming session ู‚ุจู„ ุงู„ู€ meetingุŸ Mohamed (M) โ–ถ โ–ถ โ–ถ

๐Ÿ”€ Code-Switching (Levantine + English)

Text Speaker Base XTTS-v2 Lahgtna v2 ๐ŸŸข Leva-TTS
ู‡ูŽู„ูŽู‘ู‚ ุฃู†ุง ุนู… ุฃุดุชุบู„ ุนู„ู‰ the new project. Badr (M) โ–ถ โ–ถ โ–ถ
ูˆุงู„ู„ู‡ the weather today ูƒุชูŠุฑ ุญู„ูˆ. Fatma (F) โ–ถ โ–ถ โ–ถ
ุจูุฏูู‘ูŠ ุฃุญูƒูŠู„ูƒ ุนู† the meeting ุงู„ู…ู‡ู… ุงู„ูŠูˆู…. Mona (F) โ–ถ โ–ถ โ–ถ

๐Ÿ—ฃ๏ธ Pure Levantine Arabic

Text Speaker Base XTTS-v2 Lahgtna v2 ๐ŸŸข Leva-TTS
ูƒูŠููƒ ุงู„ูŠูˆู…ุŸ ุฅู†ุช ุดูˆ ุนู… ุชุนู…ู„ ู‡ูŽู„ูŽู‘ู‚ุŸ Badr (M) โ–ถ โ–ถ โ–ถ
ู‡ูŽู„ูŽู‘ู‚ ุฑุญ ุฃุฑูˆุญ ุนู„ู‰ ุงู„ุจูŠุช ูˆุจูƒุฑุง ุจุฑุฌุน. Amina (F) โ–ถ โ–ถ โ–ถ
ุดูˆ ุฑุฃูŠูƒ ู†ุทู„ุน ู†ุชู…ุดู‰ ุดูˆูŠ ุจุนุฏ ุงู„ุดุบู„ุŸ Rami (M) โ–ถ โ–ถ โ–ถ

๐Ÿ‡ฌ๐Ÿ‡ง Pure English

Text Speaker Base XTTS-v2 Lahgtna v2 ๐ŸŸข Leva-TTS
Hello, how are you doing today? Lamyaa (F) โ–ถ โ–ถ โ–ถ
The project deadline is next Friday. Mohamed (M) โ–ถ โ–ถ โ–ถ

๐Ÿ“ Generate your own: python scripts/inference.py --text "your text"


๐Ÿš€ Getting Started

Leva-TTS supports two usage paths:

Path For whom What you get
A โ€” pip install You only want to synthesize speech The LevaTTS Python class โ€” synthesize, zero_shot_synthesize, stream, zero_shot_stream. The fine-tuned checkpoint + 10 reference speakers download automatically on first use.
B โ€” Clone the repo You want full control โ€” streaming server, Pipecat, Gradio app, fine-tuning Everything in A plus the FastAPI server, the Pipecat plugin, the Gradio demo, the evaluation suite, and the training pipeline.

๐Ÿ“ฆ Path A โ€” pip install (inference only)

1. Create the environment

conda create -n leva-tts python=3.10 -y
conda activate leva-tts

# System audio libraries (Ubuntu/Debian)
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

pip install leva-tts

First synthesis call auto-downloads the fine-tuned checkpoint and the 10 reference speakers from HuggingFace (mohammedaly22/leva-tts), falling back to the GitHub release. To pre-download:

python -c "import leva_tts; leva_tts.download_model()"

2. Initialize

from leva_tts import LevaTTS, SPEAKERS

tts = LevaTTS(
    device="cuda",          # "cuda" | "cpu" (auto-detected if omitted)
    preprocess_text=True,   # Levantine text front-end (numbers, dates, diacritics, lexicon)
    verbose=False,          # print the text-processing stages
)

print(SPEAKERS)
# ['Badr', 'Mohamed', 'Saad', 'Rami', 'Fadi',
#  'Amina', 'Fatma', 'Lamyaa', 'Mona', 'Haneen']

3. Synthesize with a built-in speaker

synthesize(text, speaker, language="ar", **gen_params) returns (wav, sr) โ€” a float32 NumPy array at 24 kHz. speaker must be one of the 10 names above, otherwise a ValueError is raised.

import soundfile as sf

wav, sr = tts.synthesize(
    "ู‡ูŽู„ูŽู‘ู‚ ุฃู†ุง ุนู… ุฃุดุชุบู„ ุนู„ู‰ the project",
    speaker="Badr",
    temperature=0.65,          # generation params are optional per-call
    repetition_penalty=5.0,
    top_p=0.85,
    top_k=50,
    speed=1.0,
)
sf.write("output.wav", wav, sr)   # sr == 24000

4. Zero-shot voice cloning

zero_shot_synthesize(text, reference_audio, language="ar", **gen_params) โ€” same as synthesize, but you pass a path to your own 3โ€“10 s reference clip instead of a built-in speaker name.

wav, sr = tts.zero_shot_synthesize(
    "ูˆุงู„ู„ู‡ the meeting today ูƒุงู†ุช important ูƒุชูŠุฑ",
    "my_voice.wav",
    language="ar",
)
sf.write("cloned.wav", wav, sr)

5. Streaming (generators)

stream(...) and zero_shot_stream(...) mirror the two methods above but yield audio chunks as they are generated โ€” ideal for low-latency playback or sending over a socket.

import numpy as np, soundfile as sf

# Built-in speaker
chunks = []
for chunk in tts.stream("ุจูุฏูู‘ูŠ ุฃุญูƒูŠู„ูƒ ุนู† the new feature ู‡ูŽู„ูŽู‘ู‚", speaker="Amina"):
    chunks.append(chunk)        # play / forward each chunk in real time
sf.write("streamed.wav", np.concatenate(chunks), 24000)

# Zero-shot streaming
for chunk in tts.zero_shot_stream("ู‡ู„ู‚ ุนู… ู†ุดุชุบู„ ุนู„ู‰ ุงู„ู…ูˆุถูˆุน", "my_voice.wav"):
    ...

Generation parameters (all optional, valid on every method): temperature, length_penalty, repetition_penalty, top_k, top_p, speed.


๐Ÿ› ๏ธ Path B โ€” Clone the repo (advanced)

For the streaming server, Pipecat integration, the Gradio app, evaluation, or fine-tuning, clone the repo and create the full conda environment.

1. Clone & create the environment

git clone https://github.com/MohammedAly22/Leva-TTS.git
cd Leva-TTS

# System dependencies
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

# Full conda environment (XTTS, training, server, pipecat, gradio)
conda env create -f environment.yml
conda activate leva-tts
pip install -e .

# Optional โ€” GPU training acceleration
bash scripts/install_deepspeed.sh

Download the checkpoint + reference speakers:

python -c "import leva_tts; leva_tts.download_model('./checkpoints')"

2. Inference (CLI)

# Built-in speaker
python scripts/inference.py --text "ูƒูŠููƒ ุงู„ูŠูˆู…ุŸ" --speaker Amina --out output.wav

# Streaming mode
python scripts/inference.py --text "..." --speaker Badr --stream

# Zero-shot with your own reference audio
python scripts/inference.py --text "..." --ref-audio your_speaker.wav --out clone.wav

3. FastAPI streaming server

# Start the server
LEVA_CHECKPOINT=./checkpoints LEVA_SPEAKER_WAV=./reference_audios/Badr.wav python -m leva_tts.server.app

# Health check
curl http://localhost:8000/health

# Batch synthesize
curl -X POST http://localhost:8000/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text":"ูƒูŠููƒ ุงู„ูŠูˆู…ุŸ","language":"ar","format":"wav"}' \
  --output output.wav

Endpoints: POST /synthesize (WAV/PCM), WS /stream (real-time chunks), GET /health, GET /metrics.

4. Pipecat integration

from leva_tts.pipecat_plugin import LevaTTSService
from pipecat.pipeline.pipeline import Pipeline

# Local GPU mode
tts = LevaTTSService(
    mode="local",
    checkpoint="./checkpoints",
    speaker_wav="./reference_audios/Badr.wav",
    language="ar",
)

# Remote WebSocket mode (points at the streaming server above)
tts_remote = LevaTTSService(
    mode="remote",
    server_url="ws://localhost:8000/stream",
    language="ar",
)

pipeline = Pipeline([..., tts, ...])

The service emits TTSStartedFrame โ†’ TTSAudioRawFrame(s) โ†’ TTSStoppedFrame, streaming audio chunk-by-chunk for conversational latency.

5. Gradio demo

python app.py
# Open http://localhost:7860

Features: ๐ŸŽค 10-speaker dropdown with reference playback ยท ๐Ÿ“ processed-text preview (see exactly what XTTS-v2 receives) ยท ๐ŸŽต batch synthesis with TTFA / RTF / VRAM metrics ยท ๐ŸŽ™๏ธ zero-shot upload (any 3โ€“10 s clip) ยท ๐Ÿ’ก pre-loaded code-switching examples.

6. Fine-tuning

The full data pipeline (50K synthetic utterances via Lahgtna-OmniVoice v2) and the XTTS-v2 fine-tuning steps are documented in the Data Pipeline section below.

python scripts/prepare_data.py --metadata data/metadata.csv --out data/
python scripts/train.py --config configs/train_config.json

๐Ÿ“Š Evaluation

python scripts/evaluate.py --checkpoint checkpoints/

# Skip ASR (faster)
python scripts/evaluate.py --checkpoint checkpoints/ --no-asr

Reports:

  • TTFA p50/p95, RTF p50/p95, Peak VRAM
  • CER/WER via Whisper large-v3 ASR round-trip
  • UTMOS (reference-free neural MOS)
  • Per-type breakdown: pure_levantine / pure_english / code_switching

Results (speaker Mohamed, NVIDIA H100, Whisper large-v3 round-trip)

Overall

Metric Value
Peak VRAM (inference) 2.13 GB
RTF p50 / p95 0.36 / 0.53
TTFA p50 / p95 (batch) 1194 / 1743 ms
TTFA streaming (first chunk) ~565 ms
CER (mean) 0.255
WER (mean) 0.496
UTMOS (reference-free MOS) 3.13 / 5.0

Per-category (intelligibility via ASR round-trip)

Category n CER โ†“ WER โ†“ RTF โ†“ UTMOS โ†‘
Pure English 3 0.144 0.190 0.365 3.35
Pure Levantine Arabic 6 0.236 0.544 0.412 2.97
Code-Switching 6 0.330 0.602 0.358 3.19

Pure English achieves the lowest CER/WER, confirming English quality is well retained. Arabic CER/WER are higher partly because Whisper large-v3 transcribes MSA-normalized Arabic while the references keep Levantine spelling and partial diacritics โ€” so a fraction of the "errors" are orthographic differences, not pronunciation errors. Code-switching is the hardest case (language boundaries), as expected.

โšก v2 โ€” Inference Optimization (TF32 + torch.compile)

We provide an optimized inference path that enables TF32 matmul (Hopper/Ampere) and torch.compile (reduce-overhead) on the autoregressive GPT โ€” the main latency bottleneck. Run it with:

# Baseline
python scripts/evaluate.py --checkpoint checkpoints --tag default

# Optimized (fp16 GPT path + TF32 + compiled kernels)
python scripts/evaluate.py --checkpoint checkpoints --tag optimized --optimize

Default vs Optimized (same 15-sentence set, speaker Mohamed, H100):

Metric Default Optimized ฮ”
RTF p50 0.362 0.355 โˆ’1.9%
RTF p95 0.528 0.494 โˆ’6.4%
TTFA p50 (ms) 1194 1150 โˆ’44 ms
UTMOS โ†‘ 3.13 3.24 +3.5%
CER 0.255 0.173 (within sampling variance)

The optimization lowers RTF (p95 โˆ’6.4%) and TTFA while slightly improving UTMOS โ€” quality is preserved. The CER/WER spread between runs is dominated by the sampling temperature (0.65), not the optimization.

Tried & rejected: Full fp16 on the HiFi-GAN decoder broke the fp32 conv filters in the speaker encoder (dtype mismatch). ONNX export of the autoregressive GPT is non-trivial (KV-cache + dynamic loop) and gave no reliable speedup over torch.compile for streaming, so TF32 + compile is the recommended path.


๐Ÿ—๏ธ Data Pipeline

Step 1 โ€” Text collection (50K sentences)

python scripts/gather_levantine_text.py
# โ†’ data/levantine_50k.txt

Sources:

  • GU-CLASP Shami Corpus โ€” 60K real Levantine sentences (Syrian, Lebanese, Palestinian, Jordanian)
  • Synthetic code-switching templates (35K+ unique combinations)

Text processing pipeline:

Raw text
  โ†’ Unicode NFC + tatweel removal
  โ†’ Number verbalization (Levantine: ู…ูŠุฉ not ู…ุฆุฉ, ุชู„ุงุชุฉ, etc.)
  โ†’ ู‡ โ†’ ุฉ correction (nouns/adjectives, names โ€” preserves ูˆุงู„ู„ู‡, pronoun suffixes)
  โ†’ Partial diacritics on homographs (ุถูŽู„ู‘, ู‡ูŽู„ูŽู‘ู‚, ู…ูุดู’, ุจูุฏูู‘ูŠ, etc.)
  โ†’ Levantine lexicon CSV overrides (148 entries)

Step 2 โ€” Audio synthesis with Lahgtna-OmniVoice v2

python scripts/generate_lahgetna_data.py
# โ†’ data/synthetic_data/wavs/<spk_id>/*.wav  +  metadata.csv
Property Value
Model oddadmix/lahgtna-omnivoice-v2
Language apc โ€” North Levantine Arabic (ISO 639-3)
Speakers 10 (5M + 5F), 5,000 utterances each
Generation params temperature=0.7, top_p=0.7, repetition_penalty=1.2

Step 3 โ€” Data preparation

python scripts/prepare_dataset.py --skip_download

Final training data:

Source Language Utterances Est. Hours
Lahgtna synthetic (primary) Levantine AR + EN CS 50,000 ~70 h
LibriSpeech clean-100 English 5,888 ~20 h
Total 55,888 ~90 h

Step 4 โ€” Fine-tuning XTTS-v2

CUDA_VISIBLE_DEVICES=0 python scripts/train.py
# Monitor:
tensorboard --logdir checkpoints/tensorboard --port 6006

Training config (configs/finetune_xtts.yaml):

Parameter Value
Model XTTS-v2 GPT backbone
Optimizer AdamW, lr=5e-6
Batch size 4 (grad_accum=8 โ†’ effective 32)
Epochs 30
Checkpoint Every 2,000 steps

๐Ÿ‘ฅ Speakers

# ID Name Gender
1 spk_01_male Badr Male
2 spk_02_male Mohamed Male
3 spk_03_male Saad Male
4 spk_04_male Rami Male
5 spk_05_male Fadi Male
6 spk_06_female Amina Female
7 spk_07_female Fatma Female
8 spk_08_female Lamyaa Female
9 spk_09_female Mona Female
10 spk_10_female Haneen Female

๐Ÿ—๏ธ Architecture

Why XTTS-v2?

Requirement XTTS-v2 F5-TTS Kokoro
Native Arabic โœ… โŒ โŒ
Code-switching โœ… โœ… โŒ
Native streaming โœ… โŒ partial
RTF < 0.3 โœ… โŒ (real ~3.0) โœ…
VRAM โ‰ค 3 GB โœ… โŒ โœ…

๐Ÿ“ Project Structure

leva-tts/
โ”œโ”€โ”€ leva_tts/
โ”‚   โ”œโ”€โ”€ text/
โ”‚   โ”‚   โ”œโ”€โ”€ processor.py       โ† TextProcessor (normalization + lexicon)
โ”‚   โ”‚   โ””โ”€โ”€ lexicon.py         โ† CSV loader
โ”‚   โ”œโ”€โ”€ inference/
โ”‚   โ”‚   โ””โ”€โ”€ engine.py          โ† LevaTTSEngine (streaming, DeepSpeed)
โ”‚   โ”œโ”€โ”€ server/
โ”‚   โ”‚   โ””โ”€โ”€ app.py             โ† FastAPI (POST /synthesize, WS /stream)
โ”‚   โ”œโ”€โ”€ pipecat_plugin/
โ”‚   โ”‚   โ””โ”€โ”€ leva_tts_service.py โ† Pipecat TTSService
โ”‚   โ””โ”€โ”€ training/
โ”‚       โ””โ”€โ”€ finetune.py        โ† XTTS-v2 GPT fine-tune
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ train.py               โ† Fine-tuning
โ”‚   โ”œโ”€โ”€ inference.py           โ† CLI synthesis (rich UI)
โ”‚   โ””โ”€โ”€ evaluate.py            โ† Full evaluation suite
โ”œโ”€โ”€ configs/
โ”‚   โ””โ”€โ”€ finetune_xtts.yaml
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ levantine_lexicon.csv  โ† 148 Levantine dialect overrides
โ”œโ”€โ”€ reference_audios/
โ”‚   โ”œโ”€โ”€ references.json        โ† 10 speaker reference configs
โ”‚   โ””โ”€โ”€ *.wav / *.mp3          โ† Reference recordings
โ””โ”€โ”€ app.py                     โ† Gradio demo

๐Ÿ“„ License

  • Code (this repository and the leva-tts package): Apache-2.0 โ€” see LICENSE.
  • Model weights (mohammedaly22/leva-tts on HuggingFace): the fine-tuned XTTS-v2 weights inherit Coqui's non-commercial license (CPML) from the base model and are for research / non-commercial use.

๐Ÿ“œ Citation

@misc{leva-tts-2026,
  title   = {Leva-TTS: Levantine Arabic / English Code-Switching TTS},
  author  = {Mohammed Aly},
  year    = {2026},
  url     = {https://huggingface.co/mohammedaly22},
}

Built with โค๏ธ for natural Levantine Arabic speech synthesis

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leva_tts-0.1.0.tar.gz (70.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leva_tts-0.1.0-py3-none-any.whl (65.4 kB view details)

Uploaded Python 3

File details

Details for the file leva_tts-0.1.0.tar.gz.

File metadata

  • Download URL: leva_tts-0.1.0.tar.gz
  • Upload date:
  • Size: 70.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for leva_tts-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b9149920d9e67f83fd9e34aa1adfd38c48d12f5d40ace2cdedc729cbc34781fb
MD5 9acbf6d8c2101dfd17853076c805207a
BLAKE2b-256 7737236da56e511c082b0e4cf59a91933e54f351e4c5d0c87de4896ce32640c1

See more details on using hashes here.

File details

Details for the file leva_tts-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: leva_tts-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 65.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for leva_tts-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ac344158bedae7ca84f127e0fae6e4dfc579b2a6cfed26005da607fc443fdff7
MD5 ac9cccda3d382d6c781311dc9d9d67e0
BLAKE2b-256 56c1e5a9e8154931edf62e0bb31746f74631b91c519e674cb211ea81637459b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page