Low-latency Levantine Arabic / English code-switching TTS (fine-tuned XTTS-v2)

These details have not been verified by PyPI

Project links

Project description

🌿 Leva-TTS

Low-Latency Code-Switching TTS — Levantine Arabic ⇄ English

A production-oriented Levantine Text-to-Speech pipeline built on a fine-tuned XTTS-v2 optimized for real-time conversational agents.

Python PyTorch FastAPI Pipecat License

🎯 KPI	Target	Measured	Status
Peak VRAM (inference)	≤ 3 GB	2.13 GB	✅
Time-to-First-Audio (p50)	< 300 ms	565 ms	⚠️
Real-Time Factor (RTF)	< 0.3	0.21	✅
Streaming output	required	chunked PCM + WS	✅

🌟 Overview

Leva-TTS is a production-ready streaming TTS system that handles natural code-switching between Levantine Arabic dialect and English — the way real speakers actually talk.

It fine-tunes XTTS-v2 (Coqui) on 50,000 high-quality synthetic Levantine Arabic + code-switching utterances generated by Lahgtna-OmniVoice v2 — a zero-shot TTS model already fine-tuned for the Levantine Arabic dialect (ISO 639-3: apc).

✨ Key Features

Feature	Details
🗣️ Natural code-switching	Intra-sentence Arabic ↔ English
⚡ Streaming output	First audio chunk < 300 ms
💾 Low VRAM	≤ 3 GB at inference
🌿 Levantine dialect	ق→/ʔ/ glottal, ج→/ʒ/, il- article, b- prefix
🔤 Smart text front-end	Partial diacritics on homographs + Levantine lexicon CSV
👥 10 speakers	5 male + 5 female, diverse Levantine accents
📡 WebSocket streaming	FastAPI server with real-time chunked PCM
🔌 Pipecat ready	Drop-in `TTSService` for voice agents

📊 Performance

Measured on a single NVIDIA H100 (fp16) over a 15-sentence held-out set (6 pure Levantine · 3 pure English · 6 code-switched), speaker Mohamed:

Metric	Target	Achieved
💾 Peak VRAM (inference only)	≤ 3 GB	2.13 GB ✅
⚡ TTFA — streaming (first chunk)	< 300 ms	~565 ms ⚠️
⏱️ TTFA — batch p50	—	707 ms
🎚️ RTF p50 / p95	< 0.3	0.21 / 0.59 ✅ (p50)
📡 Streaming	Required	✅

Notes: RTF p50 is well under target; longer sentences raise p95. Streaming TTFA (~565 ms) is the time to the first playable audio chunk — XTTS-v2's autoregressive GPT is slower than the 300 ms streaming target on first token, but audio plays continuously thereafter. VRAM excludes the Whisper model used only during evaluation.

🎵 Audio Samples

🔊 ▶ Open the interactive demo page →

Embedded audio players don't run inside GitHub Markdown, so the comparisons live on a GitHub Pages demo with real, playable <audio> players. The tables below link to each clip (click ▶ to play). Progression: Base XTTS-v2 → Lahgtna v2 → Leva-TTS.

Model Comparison

Text	Speaker	Base XTTS-v2	Lahgtna v2	🟢 Leva-TTS
كيفك اليوم؟ إنت شو عم تعمل؟	Mohamed (M)	▶	▶	▶
شو رأيك نعمل brainstorming session قبل الـ meeting؟	Mohamed (M)	▶	▶	▶

🔀 Code-Switching (Levantine + English)

Text	Speaker	Base XTTS-v2	Lahgtna v2	🟢 Leva-TTS
هَلَّق أنا عم أشتغل على the new project.	Badr (M)	▶	▶	▶
والله the weather today كتير حلو.	Fatma (F)	▶	▶	▶
بِدِّي أحكيلك عن the meeting المهم اليوم.	Mona (F)	▶	▶	▶

🗣️ Pure Levantine Arabic

Text	Speaker	Base XTTS-v2	Lahgtna v2	🟢 Leva-TTS
كيفك اليوم؟ إنت شو عم تعمل هَلَّق؟	Badr (M)	▶	▶	▶
هَلَّق رح أروح على البيت وبكرا برجع.	Amina (F)	▶	▶	▶
شو رأيك نطلع نتمشى شوي بعد الشغل؟	Rami (M)	▶	▶	▶

🇬🇧 Pure English

Text	Speaker	Base XTTS-v2	Lahgtna v2	🟢 Leva-TTS
Hello, how are you doing today?	Lamyaa (F)	▶	▶	▶
The project deadline is next Friday.	Mohamed (M)	▶	▶	▶

📝 Generate your own: python scripts/inference.py --text "your text"

🚀 Getting Started

Leva-TTS supports two usage paths:

Path	For whom	What you get
A — `pip install`	You only want to synthesize speech	The `LevaTTS` Python class — `synthesize`, `zero_shot_synthesize`, `stream`, `zero_shot_stream`. The fine-tuned checkpoint + 10 reference speakers download automatically on first use.
B — Clone the repo	You want full control — streaming server, Pipecat, Gradio app, fine-tuning	Everything in A plus the FastAPI server, the Pipecat plugin, the Gradio demo, the evaluation suite, and the training pipeline.

📦 Path A — `pip install` (inference only)

1. Create the environment

conda create -n leva-tts python=3.10 -y
conda activate leva-tts

# System audio libraries (Ubuntu/Debian)
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

pip install leva-tts

First synthesis call auto-downloads the fine-tuned checkpoint and the 10 reference speakers from HuggingFace (mohammedaly22/leva-tts), falling back to the GitHub release. To pre-download:
python -c "import leva_tts; leva_tts.download_model()"

2. Initialize

from leva_tts import LevaTTS, SPEAKERS

tts = LevaTTS(
    device="cuda",          # "cuda" | "cpu" (auto-detected if omitted)
    preprocess_text=True,   # Levantine text front-end (numbers, dates, diacritics, lexicon)
    verbose=False,          # print the text-processing stages
)

print(SPEAKERS)
# ['Badr', 'Mohamed', 'Saad', 'Rami', 'Fadi',
#  'Amina', 'Fatma', 'Lamyaa', 'Mona', 'Haneen']

3. Synthesize with a built-in speaker

synthesize(text, speaker, language="ar", **gen_params) returns (wav, sr) — a float32 NumPy array at 24 kHz. speaker must be one of the 10 names above, otherwise a ValueError is raised.

import soundfile as sf

wav, sr = tts.synthesize(
    "هَلَّق أنا عم أشتغل على the project",
    speaker="Badr",
    temperature=0.65,          # generation params are optional per-call
    repetition_penalty=5.0,
    top_p=0.85,
    top_k=50,
    speed=1.0,
)
sf.write("output.wav", wav, sr)   # sr == 24000

4. Zero-shot voice cloning

zero_shot_synthesize(text, reference_audio, language="ar", **gen_params) — same as synthesize, but you pass a path to your own 3–10 s reference clip instead of a built-in speaker name.

wav, sr = tts.zero_shot_synthesize(
    "والله the meeting today كانت important كتير",
    "my_voice.wav",
    language="ar",
)
sf.write("cloned.wav", wav, sr)

5. Streaming (generators)

stream(...) and zero_shot_stream(...) mirror the two methods above but yield audio chunks as they are generated — ideal for low-latency playback or sending over a socket.

import numpy as np, soundfile as sf

# Built-in speaker
chunks = []
for chunk in tts.stream("بِدِّي أحكيلك عن the new feature هَلَّق", speaker="Amina"):
    chunks.append(chunk)        # play / forward each chunk in real time
sf.write("streamed.wav", np.concatenate(chunks), 24000)

# Zero-shot streaming
for chunk in tts.zero_shot_stream("هلق عم نشتغل على الموضوع", "my_voice.wav"):
    ...

Generation parameters (all optional, valid on every method): temperature, length_penalty, repetition_penalty, top_k, top_p, speed.

🛠️ Path B — Clone the repo (advanced)

For the streaming server, Pipecat integration, the Gradio app, evaluation, or fine-tuning, clone the repo and create the full conda environment.

1. Clone & create the environment

git clone https://github.com/MohammedAly22/Leva-TTS.git
cd Leva-TTS

# System dependencies
sudo apt-get install -y espeak-ng ffmpeg libsndfile1

# Full conda environment (XTTS, training, server, pipecat, gradio)
conda env create -f environment.yml
conda activate leva-tts
pip install -e .

# Optional — GPU training acceleration
bash scripts/install_deepspeed.sh

Download the checkpoint + reference speakers:

python -c "import leva_tts; leva_tts.download_model('./checkpoints')"

2. Inference (CLI)

# Built-in speaker
python scripts/inference.py --text "كيفك اليوم؟" --speaker Amina --out output.wav

# Streaming mode
python scripts/inference.py --text "..." --speaker Badr --stream

# Zero-shot with your own reference audio
python scripts/inference.py --text "..." --ref-audio your_speaker.wav --out clone.wav

3. FastAPI streaming server

# Start the server
LEVA_CHECKPOINT=./checkpoints LEVA_SPEAKER_WAV=./reference_audios/Badr.wav python -m leva_tts.server.app

# Health check
curl http://localhost:8000/health

# Batch synthesize
curl -X POST http://localhost:8000/synthesize \
  -H "Content-Type: application/json" \
  -d '{"text":"كيفك اليوم؟","language":"ar","format":"wav"}' \
  --output output.wav

Endpoints: POST /synthesize (WAV/PCM), WS /stream (real-time chunks), GET /health, GET /metrics.

4. Pipecat integration

from leva_tts.pipecat_plugin import LevaTTSService
from pipecat.pipeline.pipeline import Pipeline

# Local GPU mode
tts = LevaTTSService(
    mode="local",
    checkpoint="./checkpoints",
    speaker_wav="./reference_audios/Badr.wav",
    language="ar",
)

# Remote WebSocket mode (points at the streaming server above)
tts_remote = LevaTTSService(
    mode="remote",
    server_url="ws://localhost:8000/stream",
    language="ar",
)

pipeline = Pipeline([..., tts, ...])

The service emits TTSStartedFrame → TTSAudioRawFrame(s) → TTSStoppedFrame, streaming audio chunk-by-chunk for conversational latency.

5. Gradio demo

python app.py
# Open http://localhost:7860

Features: 🎤 10-speaker dropdown with reference playback · 📝 processed-text preview (see exactly what XTTS-v2 receives) · 🎵 batch synthesis with TTFA / RTF / VRAM metrics · 🎙️ zero-shot upload (any 3–10 s clip) · 💡 pre-loaded code-switching examples.

6. Fine-tuning

The full data pipeline (50K synthetic utterances via Lahgtna-OmniVoice v2) and the XTTS-v2 fine-tuning steps are documented in the Data Pipeline section below.

python scripts/prepare_data.py --metadata data/metadata.csv --out data/
python scripts/train.py --config configs/train_config.json

📊 Evaluation

python scripts/evaluate.py --checkpoint checkpoints/

# Skip ASR (faster)
python scripts/evaluate.py --checkpoint checkpoints/ --no-asr

Reports:

TTFA p50/p95, RTF p50/p95, Peak VRAM
CER/WER via Whisper large-v3 ASR round-trip
UTMOS (reference-free neural MOS)
Per-type breakdown: pure_levantine / pure_english / code_switching

Results (speaker Mohamed, NVIDIA H100, Whisper large-v3 round-trip)

Overall

Metric	Value
Peak VRAM (inference)	2.13 GB
RTF p50 / p95	0.36 / 0.53
TTFA p50 / p95 (batch)	1194 / 1743 ms
TTFA streaming (first chunk)	~565 ms
CER (mean)	0.255
WER (mean)	0.496
UTMOS (reference-free MOS)	3.13 / 5.0

Per-category (intelligibility via ASR round-trip)

Category	n	CER ↓	WER ↓	RTF ↓	UTMOS ↑
Pure English	3	0.144	0.190	0.365	3.35
Pure Levantine Arabic	6	0.236	0.544	0.412	2.97
Code-Switching	6	0.330	0.602	0.358	3.19

Pure English achieves the lowest CER/WER, confirming English quality is well retained. Arabic CER/WER are higher partly because Whisper large-v3 transcribes MSA-normalized Arabic while the references keep Levantine spelling and partial diacritics — so a fraction of the "errors" are orthographic differences, not pronunciation errors. Code-switching is the hardest case (language boundaries), as expected.

⚡ v2 — Inference Optimization (TF32 + torch.compile)

We provide an optimized inference path that enables TF32 matmul (Hopper/Ampere) and torch.compile (reduce-overhead) on the autoregressive GPT — the main latency bottleneck. Run it with:

# Baseline
python scripts/evaluate.py --checkpoint checkpoints --tag default

# Optimized (fp16 GPT path + TF32 + compiled kernels)
python scripts/evaluate.py --checkpoint checkpoints --tag optimized --optimize

Default vs Optimized (same 15-sentence set, speaker Mohamed, H100):

Metric	Default	Optimized	Δ
RTF p50	0.362	0.355	−1.9%
RTF p95	0.528	0.494	−6.4%
TTFA p50 (ms)	1194	1150	−44 ms
UTMOS ↑	3.13	3.24	+3.5%
CER	0.255	0.173	(within sampling variance)

The optimization lowers RTF (p95 −6.4%) and TTFA while slightly improving UTMOS — quality is preserved. The CER/WER spread between runs is dominated by the sampling temperature (0.65), not the optimization.

Tried & rejected: Full fp16 on the HiFi-GAN decoder broke the fp32 conv filters in the speaker encoder (dtype mismatch). ONNX export of the autoregressive GPT is non-trivial (KV-cache + dynamic loop) and gave no reliable speedup over torch.compile for streaming, so TF32 + compile is the recommended path.

🏗️ Data Pipeline

Step 1 — Text collection (50K sentences)

python scripts/gather_levantine_text.py
# → data/levantine_50k.txt

Sources:

GU-CLASP Shami Corpus — 60K real Levantine sentences (Syrian, Lebanese, Palestinian, Jordanian)
Synthetic code-switching templates (35K+ unique combinations)

Text processing pipeline:

Raw text
  → Unicode NFC + tatweel removal
  → Number verbalization (Levantine: مية not مئة, تلاتة, etc.)
  → ه → ة correction (nouns/adjectives, names — preserves والله, pronoun suffixes)
  → Partial diacritics on homographs (ضَلّ, هَلَّق, مِشْ, بِدِّي, etc.)
  → Levantine lexicon CSV overrides (148 entries)

Step 2 — Audio synthesis with Lahgtna-OmniVoice v2

python scripts/generate_lahgetna_data.py
# → data/synthetic_data/wavs/<spk_id>/*.wav  +  metadata.csv

Property	Value
Model	`oddadmix/lahgtna-omnivoice-v2`
Language	`apc` — North Levantine Arabic (ISO 639-3)
Speakers	10 (5M + 5F), 5,000 utterances each
Generation params	temperature=0.7, top_p=0.7, repetition_penalty=1.2

Step 3 — Data preparation

python scripts/prepare_dataset.py --skip_download

Final training data:

Source	Language	Utterances	Est. Hours
Lahgtna synthetic (primary)	Levantine AR + EN CS	50,000	~70 h
LibriSpeech clean-100	English	5,888	~20 h
Total		55,888	~90 h

Step 4 — Fine-tuning XTTS-v2

CUDA_VISIBLE_DEVICES=0 python scripts/train.py
# Monitor:
tensorboard --logdir checkpoints/tensorboard --port 6006

Training config (configs/finetune_xtts.yaml):

Parameter	Value
Model	XTTS-v2 GPT backbone
Optimizer	AdamW, lr=5e-6
Batch size	4 (grad_accum=8 → effective 32)
Epochs	30
Checkpoint	Every 2,000 steps

👥 Speakers

#	ID	Name	Gender
1	spk_01_male	Badr	Male
2	spk_02_male	Mohamed	Male
3	spk_03_male	Saad	Male
4	spk_04_male	Rami	Male
5	spk_05_male	Fadi	Male
6	spk_06_female	Amina	Female
7	spk_07_female	Fatma	Female
8	spk_08_female	Lamyaa	Female
9	spk_09_female	Mona	Female
10	spk_10_female	Haneen	Female

🏗️ Architecture

Why XTTS-v2?

Requirement	XTTS-v2	F5-TTS	Kokoro
Native Arabic	✅	❌	❌
Code-switching	✅	✅	❌
Native streaming	✅	❌	partial
RTF < 0.3	✅	❌ (real ~3.0)	✅
VRAM ≤ 3 GB	✅	❌	✅

📁 Project Structure

leva-tts/
├── leva_tts/
│   ├── text/
│   │   ├── processor.py       ← TextProcessor (normalization + lexicon)
│   │   └── lexicon.py         ← CSV loader
│   ├── inference/
│   │   └── engine.py          ← LevaTTSEngine (streaming, DeepSpeed)
│   ├── server/
│   │   └── app.py             ← FastAPI (POST /synthesize, WS /stream)
│   ├── pipecat_plugin/
│   │   └── leva_tts_service.py ← Pipecat TTSService
│   └── training/
│       └── finetune.py        ← XTTS-v2 GPT fine-tune
├── scripts/
│   ├── train.py               ← Fine-tuning
│   ├── inference.py           ← CLI synthesis (rich UI)
│   └── evaluate.py            ← Full evaluation suite
├── configs/
│   └── finetune_xtts.yaml
├── data/
│   └── levantine_lexicon.csv  ← 148 Levantine dialect overrides
├── reference_audios/
│   ├── references.json        ← 10 speaker reference configs
│   └── *.wav / *.mp3          ← Reference recordings
└── app.py                     ← Gradio demo

📄 License

Code (this repository and the leva-tts package): Apache-2.0 — see LICENSE.
Model weights (mohammedaly22/leva-tts on HuggingFace): the fine-tuned XTTS-v2 weights inherit Coqui's non-commercial license (CPML) from the base model and are for research / non-commercial use.

📜 Citation

@misc{leva-tts-2026,
  title   = {Leva-TTS: Levantine Arabic / English Code-Switching TTS},
  author  = {Mohammed Aly},
  year    = {2026},
  url     = {https://huggingface.co/mohammedaly22},
}

Built with ❤️ for natural Levantine Arabic speech synthesis

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.8

Jun 1, 2026

0.1.7

Jun 1, 2026

0.1.6

Jun 1, 2026

0.1.5

Jun 1, 2026

0.1.4

Jun 1, 2026

0.1.3

May 31, 2026

0.1.2

May 31, 2026

0.1.1

May 31, 2026

This version

0.1.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leva_tts-0.1.0.tar.gz (70.5 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

leva_tts-0.1.0-py3-none-any.whl (65.4 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file leva_tts-0.1.0.tar.gz.

File metadata

Download URL: leva_tts-0.1.0.tar.gz
Upload date: May 31, 2026
Size: 70.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for leva_tts-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b9149920d9e67f83fd9e34aa1adfd38c48d12f5d40ace2cdedc729cbc34781fb`
MD5	`9acbf6d8c2101dfd17853076c805207a`
BLAKE2b-256	`7737236da56e511c082b0e4cf59a91933e54f351e4c5d0c87de4896ce32640c1`

See more details on using hashes here.

File details

Details for the file leva_tts-0.1.0-py3-none-any.whl.

File metadata

Download URL: leva_tts-0.1.0-py3-none-any.whl
Upload date: May 31, 2026
Size: 65.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for leva_tts-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ac344158bedae7ca84f127e0fae6e4dfc579b2a6cfed26005da607fc443fdff7`
MD5	`ac9cccda3d382d6c781311dc9d9d67e0`
BLAKE2b-256	`56c1e5a9e8154931edf62e0bb31746f74631b91c519e674cb211ea81637459b4`

See more details on using hashes here.

leva-tts 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🌿 Leva-TTS

Low-Latency Code-Switching TTS — Levantine Arabic ⇄ English

🌟 Overview

✨ Key Features

📊 Performance

🎵 Audio Samples

🔊 ▶ Open the interactive demo page →

Model Comparison

🔀 Code-Switching (Levantine + English)

🗣️ Pure Levantine Arabic

🇬🇧 Pure English

🚀 Getting Started

📦 Path A — pip install (inference only)

1. Create the environment

2. Initialize

3. Synthesize with a built-in speaker

4. Zero-shot voice cloning

5. Streaming (generators)

🛠️ Path B — Clone the repo (advanced)

1. Clone & create the environment

2. Inference (CLI)

3. FastAPI streaming server

4. Pipecat integration

5. Gradio demo

6. Fine-tuning

📊 Evaluation

Results (speaker Mohamed, NVIDIA H100, Whisper large-v3 round-trip)

⚡ v2 — Inference Optimization (TF32 + torch.compile)

🏗️ Data Pipeline

Step 1 — Text collection (50K sentences)

Step 2 — Audio synthesis with Lahgtna-OmniVoice v2

Step 3 — Data preparation

Step 4 — Fine-tuning XTTS-v2

👥 Speakers

🏗️ Architecture

Why XTTS-v2?

📁 Project Structure

📄 License

📜 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

📦 Path A — `pip install` (inference only)