Multilingual translator for Kabardian and Caucasian languages with neural translation and speech synthesis
Project description
🌐 Kabardian Translator
Multilingual Translation and Speech Synthesis for Caucasian Languages
Educational tool for Kabardian and Caucasian language learning with neural machine translation and speech synthesis.
Overview
Kabardian Translator is a specialized translation system focused on low-resource Caucasian languages. The system combines fine-tuned MarianMT models for Russian↔Kabardian translation with NLLB-200 for multilingual support, providing 200+ language pairs with text-to-speech capabilities.
Key capabilities:
- Specialized Russian↔Kabardian translation models
- 200+ language support via NLLB-200
- Text-to-speech with automatic stress marking
- Transliteration support for multiple scripts
- Web-based interface with bilingual UI (Russian/English)
Version 2.0.0 Changes
Translation Models
Enhanced MarianMT models for Russian↔Kabardian:
- Custom fine-tuned from Helsinki-NLP OPUS-MT base
- Improved BLEU scores: 28.13 (KBD→RU), 18.65 (RU→KBD)
- Performance: 27.1 examples/sec (KBD→RU), 6.9 examples/sec (RU→KBD)
- Models available at:
- kubataba/kbd-ru-opus and - kubataba/ru-kbd-opus
NLLB-200 integration:
- Replaces M2M100 for broader language coverage
- 200+ languages with improved quality for low-resource pairs
- ~2.3GB model size (600M distilled version)
- Better handling of morphologically complex languages
Speech Synthesis.
Accentuation system:
- Automatic stress marking for Cyrillic languages
- Russian: Silero Stress (98% accuracy)
- Ukrainian/Belarusian: Silero Stress (95% accuracy)
- Kabardian/Kazakh/Bashkir/Kyrgyz: SimpleAccentor (85-90% accuracy)
- Transliteration-based accents for Georgian, Armenian, Turkish, Azerbaijani
New language support:
- Bashkir (
bak_Cyrl) with TTS - Kyrgyz (
kir_Cyrl) with TTS - Enhanced transliteration rules for non-Cyrillic scripts
Resource Optimization
- Total size: ~2.9GB (all models)
- Memory requirements: 4GB minimum, 8GB recommended
- Lazy loading for efficient resource usage
- Improved sentence chunking for long texts
System Architecture
Translation Pipeline
Direct translation (Russian ↔ Kabardian):
Input → MarianMT Model → Output
- Fine-tuned 80M parameter models
- Latency: 37-144ms per example
Cascade translation (Other languages ↔ Kabardian):
Source Language → NLLB-200 → Russian → MarianMT → Kabardian
- Two-step process using Russian as pivot
- Latency: 100-300ms depending on source language
Multilingual translation (between other language pairs):
Source Language → NLLB-200 → Target Language
- Direct translation for 200+ language pairs
Speech Synthesis Pipeline
Cyrillic languages with stress marking:
Text → Accentuation → Silero TTS → Audio (48kHz WAV)
Non-Cyrillic languages:
Text → Transliteration → Cyrillic → Accentuation → Silero TTS → Audio
Supported transliterations:
- Georgian → Kabardian Cyrillic (preserves ejectives)
- Armenian → Hybrid Cyrillic (Kazakh+Kabardian phonemes)
- Turkish/Azerbaijani → Kazakh Cyrillic
- German/Spanish/Latvian → Hybrid Cyrillic with custom rules
Installation
System Requirements
- Python: 3.11 or higher (required for Silero Stress)
- RAM: 4GB minimum, 8GB recommended
- Storage: ~2.9GB for all models
- OS: Windows 10/11, macOS, Linux
Installation on Windows
1. Install Python 3.11+
Download from python.org:
- Download Python 3.11.9 or newer: https://www.python.org/downloads/
- Run installer and check "Add Python to PATH"
- Restart command prompt after installation
2. Verify installation
Open Command Prompt (CMD):
python --version
Should show Python 3.11.x or higher
3. Install package
# Create virtual environment (recommended)
python -m venv venv
venv\Scripts\activate.bat
# Install from PyPI
pip install kabardian-translator
# Download AI models (~2.9GB)
kabardian-download-models
# Launch application
kabardian-translator
Open browser and navigate to http://localhost:5500
4. If pip command not found
Use python -m pip instead:
python -m pip install kabardian-translator
Installation on macOS/Linux.
# Create virtual environment
python3.11 -m venv venv
source venv/bin/activate
# Install package
pip install kabardian-translator
# Download models
kabardian-download-models
# Launch
kabardian-translator
Installation from GitHub
git clone https://github.com/kubataba/kabardian-translator.git
cd kabardian-translator
# Windows
python -m venv venv
venv\Scripts\activate.bat
# macOS/Linux
python3.11 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
# Download models
kabardian-download-models
# Launch
python -m kabardian_translator.cli
Model Installation Options
Full installation (all features, ~2.9GB):
kabardian-download-models --full
Minimal installation (Kabardian↔Russian only, ~600MB):
kabardian-download-models --minimal
Base NLLB-200 only (~2.3GB):
kabardian-download-models --base-only
CLI Usage
# Start web server
kabardian-translator
# Custom port
kabardian-translator --port 8080
# Localhost only
kabardian-translator --host localhost --port 5500
# CPU-only mode (disable GPU/MPS)
kabardian-translator --cpu-only
# Translation from command line
kabardian-translate --text "Hello" --source eng_Latn --target rus_Cyrl
# Help
kabardian-translator --help
Language Support
Translation Quality (BLEU scores)
| Language Pair | BLEU | Model | Performance |
|---|---|---|---|
| Kabardian → Russian | 28.13 | MarianMT | 27.1 ex/sec |
| Russian → Kabardian | 18.65 | MarianMT | 6.9 ex/sec |
| Other low-resource | 10-15 | NLLB-200 | 5-10 ex/sec |
| Mid-resource languages | 20-25 | NLLB-200 | 5-10 ex/sec |
| High-resource languages | 30+ | NLLB-200 | 5-10 ex/sec |
TTS Support
| Language | Script | TTS Method | Stress Accuracy |
|---|---|---|---|
| Russian | Cyrillic | Direct + Silero Stress | 98% |
| Ukrainian | Cyrillic | Direct + Silero Stress | 95% |
| Belarusian | Cyrillic | Direct + Silero Stress | 95% |
| Kabardian | Cyrillic | Direct + SimpleAccentor | 90% |
| Kazakh | Cyrillic | Direct + SimpleAccentor | 90% |
| Bashkir | Cyrillic | Direct + SimpleAccentor | 88% |
| Kyrgyz | Cyrillic | Direct + SimpleAccentor | 87% |
| Georgian | Georgian | Transliteration + Accents | 85% |
| Armenian | Armenian | Transliteration + Accents | 85% |
| Turkish | Latin | Transliteration + Accents | 80% |
| Azerbaijani | Latin | Transliteration + Accents | 80% |
| German | Latin | Transliteration + Accents | 75% |
| Spanish | Latin | Transliteration + Accents | 76% |
| Latvian | Latin | Transliteration + Accents | 75% |
Supported Languages
Cyrillic script: Russian, Ukrainian, Belarusian, Kabardian, Kazakh, Bashkir, Kyrgyz
Other scripts: Georgian, Armenian, Turkish, Azerbaijani, English, German, French, Spanish, Latvian
Additional languages: 185+ languages via NLLB-200 (translation only, no TTS)
Full language codes: see NLLB-200 documentation
Performance Benchmarks
Hardware Requirements
4GB RAM system:
- Startup: 3-5 seconds
- Translation: 50-200ms per sentence
- Memory usage: ~2GB peak
- Minimal installation recommended
8GB RAM system:
- Startup: 5-8 seconds
- Translation: 50-150ms per sentence
- Memory usage: ~3GB peak
- Full installation supported
16GB RAM with GPU/MPS:
- Startup: 8-10 seconds
- Translation: 20-100ms per sentence
- Memory usage: ~4GB peak
- Hardware acceleration enabled
Translation Performance
Based on official test set (500 examples):
Kabardian → Russian:
- BLEU: 28.13, CHRF: 50.07, TER: 63.50
- Speed: 27.1 examples/second
- Latency: 37ms per example
Russian → Kabardian:
- BLEU: 18.65, CHRF: 52.66, TER: 67.57
- Speed: 6.9 examples/second
- Latency: 144ms per example
Known Limitations
Translation
- Complex morphology: Kabardian's polysynthetic structure remains challenging
- Context length: Limited to 512 tokens per translation
- Technical vocabulary: Limited coverage for modern/specialized terms
- Dialect variations: Standard dialects only
Speech Synthesis
- Character limit: 200 characters per synthesis request
- Stress accuracy: Not perfect for all words, especially rare forms
- Transliteration quality: Some non-Cyrillic pronunciations may sound unnatural
The Kabardian Challenge
Kabardian presents unique difficulties:
- 50+ consonant phonemes including ejectives
- Complex polysynthetic morphology
- Rich affixation system
- Limited parallel training data
Current models show significant improvement but are not suitable for production translation services.
Troubleshooting
Windows-Specific Issues
Python not found:
# Try alternative commands
py --version
python3 --version
# Or use full path
C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe --version
pip not found:
# Use module form
python -m pip install kabardian-translator
SSL certificate errors:
# Windows may need certificates
pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org kabardian-translator
General Issues
Low memory systems:
# Use minimal installation
kabardian-download-models --minimal
# Force CPU mode
kabardian-translator --cpu-only
# Set batch size
set KBD_TRANSLATE_BATCH_SIZE=1
kabardian-translator
Model download failures:
# Clear cache and retry
rmdir /s %USERPROFILE%\.cache\huggingface
kabardian-download-models
# Manual download
python -c "from kabardian_translator import ensure_models_downloaded; ensure_models_downloaded()"
Port already in use:
# Use different port
kabardian-translator --port 8080
Browser security warnings:
- This is normal for localhost servers.
- Click "Advanced" → "Proceed to localhost".
- Or type
thisisunsafeon the warning page (Chrome)
System Check
# Verify installation
python -c "from kabardian_translator import check_models; check_models()"
# Check PyTorch
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
# Test translation
kabardian-translate --text "Hello" --source eng_Latn --target rus_Cyrl
Technical Documentation
Models
| Component | Size | Parameters | Source |
|---|---|---|---|
| MarianMT KBD→RU | 300MB | 80M | kubataba/kbd-ru-opus |
| MarianMT RU→KBD | 300MB | 80M | kubataba/ru-kbd-opus |
| NLLB-200 | 2.3GB | 600M | NLLB200-distilled-600M |
| Silero TTS | 50MB | - | [snakers4/silero-models] (https://github.com/snakers4/silero-models) |
API Endpoints
POST /translate
{
"text": "string",
"source_lang": "kbd_Cyrl",
"target_lang": "rus_Cyrl"
}
POST /synthesize
{
"text": "string",
"lang_code": "rus_Cyrl",
"speaker": "ru_eduard"
}
Configuration
Environment variables:
KBD_TRANSLATE_BATCH_SIZE=8 # Batch size for translation
KBD_MODELS_PATH=./models # Custom model directory
KBD_FORCE_CPU=1 # Force CPU mode
Use Cases
- Language learning: Study Kabardian and related languages
- Academic research: Low-resource NLP experiments
- Field linguistics: Language documentation tools
- Community use: Accessible for native speakers
- Comparative linguistics: Multi-language analysis
License
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Permitted:
- Personal and educational use
- Research purposes
- Modifications and derivatives (with attribution)
- Non-commercial distribution
Prohibited:
- Commercial use
- Integration into paid products or services
- Profit-driven applications
Full license: https://creativecommons.org/licenses/by-nc/4.0/
Credits
Development:
- anzorq - Circassian-Russian parallel corpus and benchmarks
- Helsinki-NLP - OPUS-MT framework
- Silero Team - TTS models and stress marking
- Hugging Face - Infrastructure and model hosting
Models:
- NLLB-200 multilingual translation model
- MarianMT framework for specialized models
- Silero TTS for speech synthesis
Community:
- Kabardian language speakers for testing and feedback
- Contributors for code and documentation
Development
Project Structure
kabardian-translator/
├── kabardian_translator/ # Python package
│ ├── __init__.py
│ ├── app.py # Main application
│ ├── translation_service.py
│ ├── tts_service.py
│ ├── transliterator.py
│ ├── download_models.py # Model downloader
│ ├── tokenizer_manager.py
│ └── cli.py # CLI entry point
├── setup.py # Package configuration
├── requirements.txt # Dependencies
└── models/ # AI models (created after download)
Contributing
- Fork the repository
- Create a feature branch
- Make changes with tests
- Submit pull request with description
Issues and suggestions: GitHub Issues
Resources
- PyPI Package
- GitHub Repository
- Fine-tuned Models - KBD↔RU MarianMT models
- Training Corpus
- NLLB-200 Model
- MarianMT Documentation
- Silero TTS
Changelog
Version 2.0.0
- Enhanced MarianMT models with improved BLEU scores
- NLLB-200 integration (200+ languages)
- Full accentuation system for TTS
- Added Bashkir and Kyrgyz support
- Optimized resource usage (~2.9GB total)
- Improved transliteration rules
Version 1.0.0
- Initial M2M100-based implementation
- Basic TTS support
- 14 language pairs
Version 2.0.0 | Educational tool for Kabardian language preservation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kabardian_translator-2.0.0.tar.gz.
File metadata
- Download URL: kabardian_translator-2.0.0.tar.gz
- Upload date:
- Size: 63.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed918394569f2610be9a3b6b4c44a696dc5fc75fbf8bc45950dd05ce1cbe6208
|
|
| MD5 |
f580f0725531a41916c0add6089798a9
|
|
| BLAKE2b-256 |
5d3ac6224239ac57e3e5d4daf8a260d001695f846b5133302d75187c09aeaa60
|
File details
Details for the file kabardian_translator-2.0.0-py3-none-any.whl.
File metadata
- Download URL: kabardian_translator-2.0.0-py3-none-any.whl
- Upload date:
- Size: 66.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8ff5a07b31556ce2ef482db41ff68675555f1ccbb44dcff352c64276354e186
|
|
| MD5 |
65aa1a1eb8320232733f085eea43caa6
|
|
| BLAKE2b-256 |
2511adea8a1594140f27880242e4aba4287a09a8547d2e542e76fdb7daaeeb2b
|