Skip to main content

9jaLingo TTS-2: Text-to-Speech for Nigerian Languages โ€” English (Nigerian Accent), Hausa, Igbo, Yoruba, Pidgin with Voice Cloning

Project description

9jaLingo Logo

9jaLingo TTS-2

Text-to-Speech for Nigerian Languages with Voice Cloning

PyPI version License Python 3.10+


9jaLingo TTS-2 is a neural text-to-speech engine built for Nigerian languages. It uses causal language models with advanced architectural innovations to generate natural-sounding speech with voice cloning capabilities.

Supported Languages

Language Tag
๐Ÿ‡ณ๐Ÿ‡ฌ Nigerian Accented English en_NG
๐Ÿ‡ณ๐Ÿ‡ฌ Hausa ha
๐Ÿ‡ณ๐Ÿ‡ฌ Igbo ig
๐Ÿ‡ณ๐Ÿ‡ฌ Yoruba yo
๐Ÿ‡ณ๐Ÿ‡ฌ Pidgin pcm

Features

  • 5 Nigerian Languages โ€” English (Nigerian Accent), Hausa, Igbo, Yoruba, and Pidgin
  • Voice Cloning โ€” Clone any voice from a short reference audio sample
  • Speaker Embeddings โ€” True voice control through learned speaker representations
  • Learnable RoPE Theta โ€” Per-layer frequency scaling for better position encoding
  • Frame-Level Position Encoding โ€” Precise temporal control with configurable audio frame positioning
  • Language Tag Support โ€” Multi-language support through language identifiers
  • Extended Generation โ€” Up to 40 seconds of continuous high-quality audio
  • Flexible Sampling โ€” Temperature, top-p, and repetition penalty at generation time

Installation

pip install naijalingo-tts-2
pip install -U "transformers==4.56.0"

Quick Start

from naijalingo_tts_2 import NaijaLingoTTS

# Initialize model
model = NaijaLingoTTS('9jalingo/your-model-name')

# Generate speech
audio, text = model("Bawo ni, kilode?", language_tag="yo")

# Save to file
model.save_audio(audio, "output.wav")

Three lines for high-quality Nigerian language TTS! ๐ŸŽ‰

Voice Cloning

9jaLingo TTS-2 supports voice cloning โ€” extract a speaker's voice from a reference audio and generate speech in that voice.

from naijalingo_tts_2 import NaijaLingoTTS, SpeakerEmbedder

# Initialize
model = NaijaLingoTTS('9jalingo/your-model-name')
embedder = SpeakerEmbedder()

# Extract speaker embedding from reference audio
speaker_embedding = embedder.embed_audio_file("reference_voice.wav")  # [1, 128]

# Generate speech with that voice
audio, text = model(
    "Na so the matter be, my broda.",
    language_tag="pcm",
    speaker_emb=speaker_embedding
)
model.save_audio(audio, "cloned_voice.wav")

How Speaker Embeddings Work

The speaker embedder uses a WavLM-based model trained to extract speaker characteristics:

  1. Input: Audio at any sample rate (3-30 seconds recommended, automatically resampled to 16kHz)
  2. Processing: MVN normalization โ†’ WavLM encoder โ†’ Stats pooling โ†’ Projection โ†’ L2 normalization
  3. Output: 128-dim L2-normalized speaker embedding ready for TTS
from naijalingo_tts_2 import SpeakerEmbedder
import torch

embedder = SpeakerEmbedder()

# From audio file
embedding = embedder.embed_audio_file("voice.wav")

# From numpy array
import numpy as np
audio_array = np.random.randn(16000 * 5)
embedding = embedder.embed_audio(audio_array, sample_rate=16000)

# Save for later
torch.save(embedding, "my_voice.pt")

# Use saved embedding
audio, text = model("Hello!", speaker_emb="my_voice.pt")

Pro tip: Use 10-20 seconds of clean reference audio for best results. Audio at any sample rate is supported (automatic resampling).

Language Tag Support

from naijalingo_tts_2 import NaijaLingoTTS

model = NaijaLingoTTS('9jalingo/your-multilingual-model')

# Check available tags
print(f"Status: {model.status}")
model.show_language_tags()

# Generate with specific language
audio, text = model("Sannu da zuwa!", language_tag="ha")       # Hausa
audio, text = model("Kedu ka imere?", language_tag="ig")       # Igbo
audio, text = model("Bawo ni o?", language_tag="yo")           # Yoruba
audio, text = model("How far, my guy?", language_tag="pcm")    # Pidgin
audio, text = model("Good morning everyone.", language_tag="en_NG")  # Nigerian English

Controlling Generation

model = NaijaLingoTTS(
    '9jalingo/your-model-name',
    max_new_tokens=3000,
    suppress_logs=True,
    show_info=True,
)

audio, text = model(
    "Your text here",
    temperature=0.7,           # Lower = more deterministic
    top_p=0.9,                 # Nucleus sampling threshold
    repetition_penalty=1.2,    # Penalize repetition
    speaker_emb=speaker_emb,   # Optional: voice cloning
    language_tag="yo"          # Optional: language tag
)

Model Info Banner

When initialized, the model displays helpful information:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                                                            โ•‘
โ•‘                    9 j a L i n g o  TTS-2                  โ•‘
โ•‘                                                            โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

     ๐Ÿ—ฃ๏ธ  Nigerian Language Text-to-Speech Engine

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Model: 9jalingo/your-model-name
  Device: GPU (CUDA)
  Mode: Available language tags (5 language tags)
  Tags: en_NG, ha, ig, yo, pcm

  Configuration:
    โ€ข Sample Rate: 22050 Hz
    โ€ข Max Tokens: 3000
    โ€ข Speaker Embedding Dim: 128
    โ€ข Learnable RoPE: Enabled
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

  Supported: English (Nigerian) | Hausa | Igbo | Yoruba | Pidgin
  Voice Cloning: Enabled ๐ŸŽ™๏ธ

  Ready to generate speech! ๐ŸŽต

Playing Audio in Jupyter

from naijalingo_tts_2 import NaijaLingoTTS
from IPython.display import Audio as aplay

model = NaijaLingoTTS('9jalingo/your-model-name')
audio, text = model("E kaabo!", language_tag="yo")

aplay(audio, rate=model.sample_rate)

API Reference

NaijaLingoTTS(model_name, **kwargs)

Main TTS interface.

Parameters:

  • model_name (str): HuggingFace model ID or local path
  • max_new_tokens (int): Max generation length (default: 3000)
  • device_map (str): Device mapping (default: "auto")
  • suppress_logs (bool): Suppress library logs (default: True)
  • show_info (bool): Display model info banner (default: True)

Methods:

  • model(text, language_tag=None, speaker_emb=None, temperature=1.0, top_p=0.95, repetition_penalty=1.1) โ†’ (audio, text)
  • model.generate(...) โ†’ Same as __call__
  • model.save_audio(audio, path) โ†’ Save audio to file
  • model.show_model_info() โ†’ Display model banner
  • model.show_language_tags() โ†’ Display available language tags
  • model.load_speaker_embedding(path) โ†’ Load speaker embedding from .pt file

SpeakerEmbedder(model_name, device, max_duration_sec)

Extract speaker embeddings from audio.

Parameters:

  • model_name (str): HuggingFace model ID (default: "nineninesix/speaker-emb-tbr")
  • device (str): "cuda" or "cpu" (default: auto-detect)
  • max_duration_sec (float): Max audio length in seconds (default: 30.0)

Methods:

  • embedder.embed_audio(audio, sample_rate=16000) โ†’ [1, 128] tensor
  • embedder.embed_audio_file(path) โ†’ [1, 128] tensor

Convenience Function

from naijalingo_tts_2 import compute_speaker_embedding

embedding = compute_speaker_embedding("speaker.wav")

Complete Example: Voice Cloning Pipeline

from naijalingo_tts_2 import NaijaLingoTTS, SpeakerEmbedder
import torch

# 1. Initialize
tts = NaijaLingoTTS('9jalingo/your-model-name')
embedder = SpeakerEmbedder()

# 2. Extract speaker embedding
speaker_emb = embedder.embed_audio_file("reference_speaker.wav")
torch.save(speaker_emb, "my_voice.pt")

# 3. Generate in multiple languages with cloned voice
languages = {
    "en_NG": "Good morning, how are you doing today?",
    "ha":    "Ina kwana, yaya dai?",
    "ig":    "แปคtแปฅtแปฅ แปma, kedu ka แป‹ mere?",
    "yo":    "E kaaro, bawo ni o se wa?",
    "pcm":   "Good morning o, how body?",
}

for lang, text in languages.items():
    audio, _ = tts(text, language_tag=lang, speaker_emb=speaker_emb)
    tts.save_audio(audio, f"output_{lang}.wav")
    print(f"โœ… Generated {lang}: {text}")

Voice Cloning Best Practices

Reference Audio Quality:

  • โœ… Clean recordings without background noise
  • โœ… Proper audio levels (not too quiet, not clipping)
  • โœ… 10-20 seconds of clear speech
  • โœ… Any sample rate (automatic resampling to 16kHz)
  • โŒ Avoid noisy, compressed, or low-quality recordings

Better Speaker Representation:

# Average multiple samples for more robust embedding
embeddings = [embedder.embed_audio_file(f) for f in sample_files]
averaged_embedding = torch.stack(embeddings).mean(dim=0)

Architecture

Two-Stage Pipeline:

  1. Text โ†’ Audio Tokens: Modified LFM2 causal LM generates discrete audio tokens
  2. Audio Tokens โ†’ Waveform: NVIDIA NeMo NanoCodec decodes tokens to 22kHz audio

Key Innovations:

  • Learnable RoPE โ€” Per-layer frequency scaling for better positional encoding
  • Frame-Level Positions โ€” Audio tokens grouped in frames of 4 with shared positions
  • Speaker Embeddings โ€” 128-dim continuous representations for zero-shot voice cloning
  • Language Tags โ€” Accent and language control via prefix identifiers

Requirements

  • Python 3.10+
  • CUDA-capable GPU (recommended)
  • PyTorch 2.0+
  • Transformers 4.56.0+
  • NeMo Toolkit

Performance

Setup ~10s of audio
GPU (CUDA) 2-5 seconds
CPU 20-60 seconds
VRAM 4-8 GB (bfloat16)

Responsible Use

Prohibited activities include:

  • Generating false or misleading information
  • Impersonating individuals without consent
  • Hate speech, harassment, or incitement of violence
  • Malicious activities such as spamming, phishing, or fraud

By using this package, you agree to comply with all applicable laws.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

@software{naijalingo_tts_2,
  author = {9jaLingo},
  title = {9jaLingo TTS-2: Text-to-Speech for Nigerian Languages},
  year = {2026},
  publisher = {PyPI},
  howpublished = {\url{https://pypi.org/project/naijalingo-tts-2/}},
  note = {Supports English (Nigerian), Hausa, Igbo, Yoruba, and Pidgin with voice cloning}
}

Made with โค๏ธ by 9jaLingo for Nigeria ๐Ÿ‡ณ๐Ÿ‡ฌ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naijalingo_tts_2-0.1.1.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

naijalingo_tts_2-0.1.1-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file naijalingo_tts_2-0.1.1.tar.gz.

File metadata

  • Download URL: naijalingo_tts_2-0.1.1.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for naijalingo_tts_2-0.1.1.tar.gz
Algorithm Hash digest
SHA256 281639dc2d08abd6ff1fc9b1cb41153168d4371520d6ffdaec96e57300df12dd
MD5 e20ec126c9b6dd8d46f2fc8e6330c9f3
BLAKE2b-256 688954ad46baf3bc8ba19377be9e527d625aa9e4b70d4ca663dd32eaf2132c91

See more details on using hashes here.

File details

Details for the file naijalingo_tts_2-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for naijalingo_tts_2-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5d5842ce317ef64e5c085394bfeb8721d4eae40fc9c81d5c2cc609dcf0562d62
MD5 b3f951cde33a3f6564fa77e5958dad8d
BLAKE2b-256 b6cd12269bc65fbb4113514460a8d160163cb39560ec2ed99664e39171dc170b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page