Multilingual translator for Kabardian and Caucasian languages with speech synthesis
Project description
๐ Kabardian Translator
Voice-Enabled Multilingual Translator for Caucasian Languages
๐ฏ Educational tool for learning Kabardian and Caucasian languages with AI-powered translation and speech synthesis
โจ What's New in v1.0.3
๐ Major Performance Improvements
v1.0.3 brings significant efficiency gains while maintaining practical translation quality:
| Aspect | Before (v1.0) | After (v1.0.3) | Improvement |
|---|---|---|---|
| Disk Space | ~15GB | ~3GB | 5x smaller โฌ๏ธ |
| RAM Usage | 16GB required | 4GB minimum | 4x more efficient ๐ |
| Model Size | 1.2B parameters | 418M + 80Mร2 | 3x lighter ๐ชถ |
| RUโKBD Quality | Baseline | Improved | โ Better specialized models |
๐ฅ Key Innovations
1. ๐ฏ Specialized Lightweight Models for Kabardian
We trained two dedicated MarianMT models specifically for RussianโKabardian translation:
- Model: Fine-tuned from Helsinki-NLP OPUS-MT (Englih-Russian, Russian-Ukrainian base)
- Size: ~80M parameters each (~300MB per model)
- Training data: 220K parallel sentences two side from adiga-ai/circassian-parallel-corpus
- Performance: Outperforms 1.2B M2M100 on Kabardian despite being 15x smaller
Why they perform better:
- Focused on single language pair (not spread across 100+ languages)
- 1200M parameters serving 100 languages โ 12M per language vs 80M dedicated
- Specialized training on Kabardian linguistic patterns
Benchmark Results (500 examples):
| Direction | Model | BLEU | chrF | TER | Size |
|---|---|---|---|---|---|
| RUโKBD | Opus-MT (kubataba) | 8.48 | 32.7 | 86.09 | 300MB |
| RUโKBD | M2M100 1.2B (anzorq) | 6.09 | 33.89 | 84.35 | 2.4GB |
| KBDโRU | Opus-MT (kubataba) | 12.75 | 32.48 | 81.35 | 300MB |
| KBDโRU | M2M100 1.2B (anzorq) | 7.44 | 28.15 | 89.98 | 2.4GB |
๐ Winner: Specialized models deliver +39% BLEU improvement for RUโKBD while using 1/8th the size
Note on BLEU scores: The relatively low BLEU scores are due to tokenization limitations inherited from the Russian-English base model:
- Kabardian digraphs (ะบั ั, ัำ, ะปำ, ัำ, ัำ) get split into separate tokens
- Complex morphological chains are fragmented
- Rare morphemes and negation markers (-ะบััะผ) aren't properly identified
- N-gram matches are artificially reduced despite semantically correct translations
- Despite lower BLEU, translations are semantically accurate and usable for practical purposes
The main barrier to further improvement: Creating a specialized tokenizer for Kabardian that properly handles its polysynthetic morphology and rich consonant system. We've detailed this challenge in our article: Tokenization as the Key to Language Models for Low-Resource Languages (in Russian).
2. โก Optimized Multilingual Model
Replaced heavy M2M100 1.2B with M2M100 418M:
- Size: 1.6GB (down from 4.7GB)
- Languages: Still supports 100+ languages
- Precision: Float32 for stability
- Performance: Comparable quality for most pairs
M2M100 418M Performance (research benchmarks):
- Low-resource pairs: BLEU 8.9-10.1
- Mid-resource pairs: BLEU 21.4-23.4
- High-resource pairs: BLEU 35.0-39.8
3. ๐พ Dramatically Reduced Requirements
| Computer Type | RAM | Supported Features | Performance |
|---|---|---|---|
| Old laptop | 4GB | Kabardian โ Russian | Fast โก |
| Standard PC | 8GB | All 14 languages | Optimal โจ |
| Apple Silicon | 16GB | MPS acceleration + all | Maximum ๐ |
4. ๐ค Enhanced Transliterator
- Updated core transliteration engine to v1.0.3
- More accurate Georgian/Armenian alphabet conversion
- Improved phonetic representation for non-Cyrillic scripts
- Better handling of diacritics and special characters
โจ Core Features
- ๐ง Smart Translation: 14 languages with specialized Kabardian models
- ๐ Voice Synthesis: Text-to-speech with automatic transliteration
- ๐ค Phonetic Support: Georgian/Armenian alphabets โ readable Cyrillic
- โก Efficiency Optimized: Runs on any computer (4GB+ RAM)
- ๐จ Modern UI: Dark/light themes, keyboard shortcuts
๐๏ธ System Architecture
Translation Pipeline.
Direct Translation (Russian โ Kabardian):
Input Text โ Specialized Opus-MT Model โ Output Text
- Uses fine-tuned 80M parameter models
- Best quality for RUโKBD pairs
- ~200-600ms latency
Cascade Translation (Any Language โ Kabardian):
Source Language โ M2M100 418M โ Russian โ Opus-MT โ Kabardian.
- Two-step process through Russian as pivot
- Supports 100+ languages
- ~400-900ms latency
Multilingual Translation (Non-Kabardian pairs):
Source Language โ M2M100 418M โ Target Language
- Direct translation between supported languages
- Quality varies by language pair resource availability
Voice Synthesis Pipeline
For Cyrillic Languages (Russian, Ukrainian, Belarusian, Kabardian, Kazakh):
Text โ Silero TTS โ Audio (48kHz WAV).
- Direct synthesis, no preprocessing needed
- High quality (92-98% accuracy)
For Non-Cyrillic Languages (Georgian, Armenian, Turkish, Azerbaijani):
Input Text โ Transliterator โ Cyrillic Text โ Silero TTS โ Audio
Transliteration Process:
- Script Detection: Identifies source alphabet (Georgian, Armenian, Latin).
- Phonetic Mapping: Converts characters to closest Cyrillic phonemes.
- Context Rules: Handles digraphs, word boundaries, special cases.
- Target Selection: Routes to appropriate TTS speaker (Russian/Kabardian).
Example Flow:
Georgian: "แแแแแ แฏแแแ"
โ Transliterator
Cyrillic: "ะณะฐะผะฐัะดะถะพะฑะฐ"
โ Silero TTS (kbd_eduard)
Audio: gamardzhoba.wav
Transliteration Features.
- Georgian โ Kabardian Cyrillic: Preserves ejectives (แโะฟำ, แขโัำ, แฌโัำ)
- Armenian โ Hybrid Cyrillic: Maps to Kazakh+Kabardian phonemes
- Turkish/Azerbaijani โ Kazakh Cyrillic: Handles ฤ contextually, maps ลโั, รงโั
- German โ Hybrid Cyrillic: sp/st rules, umlauts (รคโั, รถโำฉ, รผโะนั)
- Spanish โ Hybrid Cyrillic: chโั, llโะน, rrโัั, silent h
- Latvian โ Hybrid Cyrillic: Long vowels (ฤโะฐะฐ, ฤโัั), palatalization (ฤทโะบั, ฤผโะปั)
๐ Quick Start.
System Requirements
- Python: 3.11 or higher
- RAM: 4GB minimum (basic use), 8GB recommended (all languages)
- Storage: ~3GB for all AI models
- OS: Windows, macOS, Linux (any computer!)
๐ฆ Installation via PyPI (Recommended)
# Create virtual environment
python3.11 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install from PyPI
pip install kabardian-translator
# Download AI models (required, ~3GB)
kabardian-download-models
# Launch the application
kabardian-translator
# โ Open http://localhost:5500 in your browser
๐๏ธ Installation Modes
Minimal Installation (Kabardian โ Russian only):
kabardian-download-models --minimal # ~600MB
Full Installation (All 14 languages):
kabardian-download-models --full # ~3GB
๐ ๏ธ Alternative Installation Methods
From GitHub (Development Version)
git clone https://github.com/kubataba/kabardian-translator.git
cd kabardian-translator
python3.11 -m venv venv
source venv/bin/activate
pip install -e .
kabardian-download-models
kabardian-translator
Manual Installation (Legacy)
git clone https://github.com/kubataba/kabardian-translator.git
cd kabardian-translator
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 download_models.py
python3 app.py
๐๏ธ CLI Options
# Custom port
kabardian-translator --port 8080
# Localhost only (more secure)
kabardian-translator --host localhost --port 5500
# CPU-only mode (for 4GB RAM systems)
kabardian-translator --cpu-only
# Debug mode
kabardian-translator --debug
# Help
kabardian-translator --help
โก Performance Optimizations
Technical Improvements
- Specialized MarianMT models: Fine-tuned specifically for Kabardian, achieving better results than multilingual models
- M2M100 418M: 3x smaller than original, supports 100+ languages
- Float32 stability: No precision loss, more reliable inference
- Automatic memory cleanup: Stable long-term operation
- Lazy model loading: Only loads models when needed
Performance Comparison
On 4GB RAM Computer:
- Startup: ~5 seconds
- RUโKBD translation: 200-600ms
- Memory usage: ~2GB peak
On 8GB RAM Computer:
- Startup: ~8 seconds
- Any translation: 300-800ms
- Memory usage: ~4GB peak
On 16GB RAM with Apple Silicon:
- Startup: ~10 seconds
- MPS-accelerated translation: 150-400ms
- Memory usage: ~6GB peak
๐ Quality and Performance
Translation Quality by Direction
| Language Pair | BLEU Range | Quality | Model Type |
|---|---|---|---|
| Russian โ Kabardian | 9-13 | Good | Specialized Opus-MT |
| Any โ Kabardian (via Russian) | 9-15 | Acceptable | Cascade (2 models) |
| Low-resource pairs | 9-10 | Acceptable | M2M100 418M |
| Mid-resource pairs | 15-20 | Good | M2M100 418M |
| High-resource pairs | >30 | Excellent | M2M100 418M |
Note: BLEU scores for Kabardian are artificially low due to tokenization mismatch, not actual translation quality. M2M100 418M performance varies significantly based on language pair resource availability.
Voice Synthesis Quality
| Language | TTS Quality | Method | Accuracy |
|---|---|---|---|
| Russian, Ukrainian, Belarusian | 95-98% | Direct (Silero V5 CIS) | Excellent |
| Kabardian, Kazakh | 92-95% | Direct (Silero V5 CIS) | Excellent |
| Georgian, Armenian | 88-92% | Transliteration โ TTS | Good |
| Turkish, Azerbaijani | 85-88% | Transliteration โ TTS | Good |
| German, Spanish, Latvian | 78-82% | Transliteration โ TTS | Acceptable |
๐ Practical Applications
- For Schools & Universities: Works even in computer labs with old PCs
- For Personal Use: Runs on any home computer
- For Field Research: Smaller size makes it easier to share and install
- For Developers: Easier to test and modify with reduced resource requirements
- For Language Learners: Accessible tool for practicing Kabardian and related languages
โ ๏ธ Known Limitations
Translation Limitations
- Tokenization challenges: The main barrier to higher BLEU scores (see technical explanation above)
- Kazakh/Georgian quality: M2M100 418M has known issues with these specific language pairs (inherent model limitation)
- Technical vocabulary: May struggle with modern technical terms not in training corpus
- Context length: Limited to 512 tokens per translation
- Low-resource reality: As a low-resource language tool, performance cannot match high-resource language pairs
TTS Limitations
- Max 200 characters per synthesis
- Imperfect pronunciation for transliterated languages
- No intonation control
- Stress marks not shown in transliteration
The Tokenization Challenge
The biggest obstacle to improving Kabardian NMT models is creating a specialized tokenizer that properly handles:
- Polysynthetic morphology with complex affixation
- 50+ consonant phonemes including ejectives
- Digraphs (ะบั ั, ัำ, ะปำ, ัำ, ัำ) as single units
- Morphological negation and modality markers
- Ergative-absolutive case system
Read more: Tokenization as the Key to Language Models for Low-Resource Languages - detailed technical analysis of this challenge (in Russian).
๐ ๏ธ Troubleshooting
Low RAM Systems (4GB)
# Minimal installation (Kabardian โ Russian only)
kabardian-download-models --minimal
# Force CPU mode
kabardian-translator --cpu-only
Insufficient Disk Space
# Check available space
df -h
# Use minimal installation
kabardian-download-models --minimal # Only 600MB
Models Won't Download
# Try mirror if Hugging Face is blocked
export HF_ENDPOINT=https://hf-mirror.com
kabardian-download-models
Quick System Check
# Test without downloading models
python -c "from kabardian_translator import check_models; check_models()"
# Check compatibility
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
Command Not Found
# Reinstall package
pip uninstall kabardian-translator
pip install kabardian-translator
# Or use Python module call
python -m kabardian_translator.cli --port 5500
๐ License and Usage
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
โ
Allowed: Personal, educational, research, modifications, distribution with attribution
โ Prohibited: Commercial use, profit-driven services, integration into paid products
๐ Full license: https://creativecommons.org/licenses/by-nc/4.0/
๐ Acknowledgments.
Special thanks for v1.0.3 optimization:
- anzorq - Created the Circassian-Russian parallel corpus and fine-tuned M2M100 baseline models
- Helsinki-NLP - OPUS-MT base models
- M2M100 - M2M100 418M framework
- Silero Team - High-quality TTS models
- Hugging Face - Infrastructure and Transformers library
- Kabardian language community - Testing, feedback, and support
๐ Support and Contribution
- Found a bug? โ GitHub Issues
- Want to help? โ Fork โ Branch โ Commit โ Pull Request
- Run benchmarks? โ See benchmarks/README.md for reproducible tests
- Questions? โ Check Troubleshooting section
- Technical discussion: Read our article on tokenization challenges
๐ Migration from v1.0
If you had the old version installed:
# Remove old models (free up ~12GB!)
rm -rf models/
# Update to new version
pip install --upgrade kabardian-translator
# Download new optimized models
kabardian-download-models --full
Migration benefits:
- Save 12GB disk space
- Works on any computer (4GB+ RAM)
- Improved quality for RussianโKabardian
- More stable operation
- Faster installation
๐บ๏ธ Roadmap
- v1.1 (Q1 2026): Expanding North Caucasian Languages Support
- v1.2 (Q2 2026): API, Redis caching, user history, batch translation
- v2.0 (Q3 2026): Mobile app, offline mode, Telegram Bot
- Future: Custom Kabardian tokenizer for improved translation quality
๐ Additional Resources
- PyPI Package - Official package repository
- Benchmark Scripts - Reproducible performance tests
- Benchmark Results - Detailed test results (500 examples)
- M2M100 418M Documentation
- MarianMT Framework
- Specialized Models - RUโKBD Opus-MT models
- Training Corpus - by anzorq
- Tokenization Article - Technical deep-dive
- PyTorch Optimization Guide
๐ Technical Specifications
Model Details
| Model | Parameters | Size | Purpose |
|---|---|---|---|
| Opus-MT RUโKBD | 80M | 300MB | Russian โ Kabardian (specialized) |
| Opus-MT KBDโRU | 80M | 300MB | Kabardian โ Russian (specialized) |
| M2M100 418M | 418M | 1.6GB | 100+ languages (multilingual) |
| Silero TTS V5 CIS | - | ~50MB | Voice synthesis (Russian/Kabardian) |
Total: All models occupy ~2.3GB vs ~15GB in v1.0
๐ง System Components
Core Modules
Translation Engine (translation_service.py):
- Manages 3 translation models (2ร Opus-MT + M2M100)
- Lazy loading for memory efficiency
- Automatic cascade routing for unsupported pairs
- Preprocessing: Palochka (ำ) handling for Kabardian
TTS Service (tts_service.py):
- Silero TTS V5 CIS model integration
- Lazy model loading (loads only when needed)
- Automatic transliteration routing
- 2 speakers:
ru_eduard(Russian),kbd_eduard(Kabardian/Kazakh) - Output: 48kHz WAV audio
Transliterator (transliterator.py):
- 7 script mappings (Georgian, Armenian, Turkish, Azerbaijani, German, Spanish, Latvian)
- Context-aware rules: word boundaries, digraphs, phonetic context
- 600+ character mappings + 50+ special rules
- Phonetically optimized for TTS clarity
Data Flow
โโโโโโโโโโโโโโโ
โ User Input โ
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Flask Web Server โ
โ (app.py) โ
โโโโโโโโฌโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโบ Translation Request
โ โ
โ โผ
โ โโโโโโโโโโโโโโโโโโโโ
โ โ Translation โ
โ โ Service โ
โ โ - Model Router โ
โ โ - Preprocessor โ
โ โโโโโโโโฌโโโโโโโโโโโโ
โ โ
โ โผ
โ โโโโโโโโโโโโโโโโโโโโ
โ โ Opus-MT / M2M100 โ
โ โ Models โ
โ โโโโโโโโฌโโโโโโโโโโโโ
โ โ
โ โผ
โ [Translated Text]
โ
โโโโโโโโโโโโโโโบ TTS Request
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ TTS Service โ
โ - Script Detect โ
โ - Transliterator โ
โโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Transliterator โ
โ (if needed) โ
โโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Silero TTS โ
โ Model โ
โโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
[Audio WAV]
Language Support Matrix
| Language | Code | Script | Translation | TTS | Transliteration |
|---|---|---|---|---|---|
| Kabardian | kbd_Cyrl | Cyrillic | โ Specialized | โ Direct | โ |
| Russian | rus_Cyrl | Cyrillic | โ Specialized | โ Direct | โ |
| Ukrainian | ukr_Cyrl | Cyrillic | โ M2M100 | โ Direct | โ |
| Belarusian | bel_Cyrl | Cyrillic | โ M2M100 | โ Direct | โ |
| Kazakh | kaz_Cyrl | Cyrillic | โ M2M100 | โ Direct | โ |
| Georgian | kat_Geor | Georgian | โ M2M100 | โ Via Kbd | โ 38 mappings |
| Armenian | hye_Armn | Armenian | โ M2M100 | โ Via Hybrid | โ 45 mappings |
| Turkish | tur_Latn | Latin | โ M2M100 | โ Via Kaz | โ 28 mappings |
| Azerbaijani | azj_Latn | Latin | โ M2M100 | โ Via Kaz | โ 32 mappings |
| German | deu_Latn | Latin | โ M2M100 | โ Via Hybrid | โ 35 mappings + rules |
| Spanish | spa_Latn | Latin | โ M2M100 | โ Via Hybrid | โ 30 mappings + rules |
| Latvian | lav_Latn | Latin | โ M2M100 | โ Via Hybrid | โ 32 mappings + rules |
Total: 14 languages, 7 scripts, 600+ transliteration rules
Made with โค๏ธ for preserving and studying the Kabardian language
Version 1.0.3 - Practical efficiency for real-world use
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kabardian_translator-1.0.3.tar.gz.
File metadata
- Download URL: kabardian_translator-1.0.3.tar.gz
- Upload date:
- Size: 63.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b605895a4d6bd47cc6f7d3bb1f225ec5e608ed554857ee7fb3ac20d6d2e0a567
|
|
| MD5 |
9846a2e99ea329c012825889d9c85079
|
|
| BLAKE2b-256 |
99dec8a135095bee30199f37fffca9a0f96f94facc43ea06090880a3695ebb11
|
File details
Details for the file kabardian_translator-1.0.3-py3-none-any.whl.
File metadata
- Download URL: kabardian_translator-1.0.3-py3-none-any.whl
- Upload date:
- Size: 57.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3cc6ad03b6afe50d0be9e7604f4cb70bc07389a1c57219f13611128dcf7d434
|
|
| MD5 |
f2303d7dbaebacfbd3e0264db9c8ecf7
|
|
| BLAKE2b-256 |
828a9c053d4a0e9736099d01fa1d5253db13e0df4108c7bd92ee09b6909b5240
|