Russian stress accent prediction using Transformer model
Project description
Russian Stress Accent Predictor (Accentor) - ruaccent-predictor
Automatic stress accent placement in Russian text using a character-level Transformer model. Available on PyPI as ruaccent-predictor.
📋 Description
This project is a deep learning model for automatic stress accent placement in Russian text. The model is trained on a dataset of over 224,000 sentence pairs from literary works and achieves 99.7% accuracy on the validation dataset.
Key Features
- ✅ 99.7% accuracy on validation dataset
- 🚀 Two output formats: apostrophe (я́) and synthesis (+я)
- ⚡ Batch processing support for speed optimization
- 💾 Built-in result caching
- 🔧 Support for CPU, CUDA, and Apple MPS (Metal)
- 📦 Easy pip installation:
pip install ruaccent-predictor
Technical Details
Character-Level Model: The model operates at the character level with an automatically extracted vocabulary of 224 characters from the training dataset. This approach allows for high accuracy while maintaining a compact model size (~12.5M parameters).
Vocabulary: Automatically extracted from the training corpus, includes:
- Cyrillic letters (uppercase and lowercase)
- Basic punctuation
- Latin letters
- Special tokens
Output Formats
Apostrophe format: Stress mark is placed after the stressed vowel
Example: В лесу' родила'сь ёлочка (Optimal for reading with stress marks during learning)
Synthesis format: Plus sign is placed before the stressed vowel
Example: В лес+у родил+ась ёлочка (Optimal for speech synthesis, e.g., Silero TTS)
⚠️ Model Limitations
The model has the following known limitations:
- Does not restore missing letter "ё": The model works with the input text as-is and does not replace "е" with "ё"
- Does not mark stress on "ё": Since "ё" is always stressed in Russian, the model does not place additional stress marks on it
- Single-vowel words: Words with only one vowel are not marked as they are inherently stressed
- No grammatical analysis: The model operates purely on character sequences without morphological or syntactic analysis
- Training data limitations: Accuracy may vary for texts outside the literary domain of the training data
📦 PyPI Installation
The package is available on PyPI as ruaccent-predictor:
pip install ruaccent-predictor
Usage as Python Package
from ruaccent import load_accentor
# Load the model
accentor = load_accentor()
# Predict stress accents
text = "привет мир"
result = accentor(text)
print(result) # приве'т мир
Usage as CLI Tool
After installation, use the ruaccent command:
# Process single text
ruaccent "привет как дела"
# Process file
ruaccent --input-file input.txt --output-file output.txt
# Synthesis format
ruaccent "привет" --format synthesis
# Both formats
ruaccent "текст" --format both
# Pipe input
echo "мама мыла раму" | ruaccent
Available Options:
--format: Output format (apostrophe, synthesis, both)--batch-size: Batch size for processing (default: 8)--device: Device for inference (auto, cpu, cuda, mps)--input-file,-i: Input text file--output-file,-o: Output file
🎯 Performance
Benchmarks
- Accuracy: 99.7% on validation set (22,000 sentences)
- Speed: ~10 sentences /sec with batch_size=8 on Mac Mini M4
- Model size: ~12.5M parameters
- Vocabulary: 224 characters (Cyrillic, punctuation, Latin)
Optimal Settings
# For maximum performance
accentor = load_accentor()
results = accentor(texts, batch_size=8, format='apostrophe')
📁 Project Structure
Russian-Stress-Accent-Predictor/
├── ruaccent/ # Main package (PyPI)
│ ├── __init__.py
│ ├── accentor.py # Main module with model
│ └── cli.py # CLI interface
├── model/ # Trained model
│ ├── README.md # Model documentation
│ ├── acc_model.pt # Model weights (30MB, Git LFS)
│ └── vocab.json # Character vocabulary
├── data/ # Datasets
│ ├── train.csv # Training set (115MB, Git LFS)
│ └── val.csv # Validation set (13MB)
├── examples/ # Usage examples
│ ├── basic_usage.py # Basic examples
│ └── batch_processing.py # Batch processing and tests
├── train_scripts/ # Model training scripts
│ ├── model.py # Transformer architecture
│ ├── prepare_data.py # Data preparation
│ ├── train_model.py # Model training
│ └── README.md # Training instructions
├── .gitattributes # Git LFS configuration
├── .gitignore # Ignored files
├── LICENSE # MIT license
├── MANIFEST.in # Included files for PyPI
├── pyproject.toml # Package configuration
├── README.md # This documentation
├── requirements.txt # Python dependencies
├── setup.py # Package setup
└── run_training.sh # Training launch script
🧪 Usage Examples
Basic Example (examples/basic_usage.py)
from ruaccent import load_accentor
accentor = load_accentor()
texts = ["привет мир", "мама мыла раму", "солнце светит ярко"]
# Apostrophe format
results = accentor(texts, format='apostrophe')
for original, accented in zip(texts, results):
print(f"{original} → {accented}")
Batch Processing and Tests (examples/batch_processing.py)
python examples/batch_processing.py
Tests performance with different batch sizes, shows cache statistics and optimal settings.
🏗️ Training Scripts
For developers and researchers in the train_scripts/ folder:
Training Scripts
model.py- Transformer architecture definitionprepare_data.py- Data preprocessing and preparationtrain_model.py- Main training script
Training from Scratch
# Install dependencies
pip install torch pandas tqdm
# Start training
cd train_scripts
python train_model.py
Note: Training requires significant resources (GPU recommended) and takes several hours.
🔤 Output Formats
1. Apostrophe Format (я').
Apostrophe is placed after the stressed vowel:
- Input:
привет - Output:
приве'т - Use case: Text display, reading
2. Synthesis Format (+я).
Plus sign is placed before the stressed vowel:
- Input:
привет - Output:
прив+ет - Use case: Speech synthesis, TTS systems
🚀 Quick Start
After pip installation:
# Verify installation
ruaccent "тестовая фраза"
# Run examples
python examples/basic_usage.py
From Source Code:
# Clone repository
git clone https://github.com/kubataba/Russian-Stress-Accent-Predictor.git
cd Russian-Stress-Accent-Predictor
# Install in development mode
pip install -e .
# Use as usual
ruaccent "ваш текст"
📊 Performance and Caching
The model uses intelligent caching:
- Cache hits: ~0.0000s per text
- Cache misses: ~0.5s for first call
- Optimal batch size: 8 (10 sentences /sec on MPS)
- Cache size: Up to 10,000 items
# View cache statistics
cache_info = accentor.cache_info()
print(f"Cache hits: {cache_info['hits']}, misses: {cache_info['misses']}")
# Clear cache
accentor.clear_cache()
🤝 Contributing
Contributions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
📄 License
The project is distributed under the MIT license. See the LICENSE file for details.
The dataset is also distributed under the MIT license:
- Source: nevmenandr/accentual-syllabic-verse-in-russian-prose
- License: MIT
🙏 Acknowledgments
- Dataset provided by nevmenandr
- Project uses the Transformer architecture from PyTorch
- Inspired by natural language processing tasks for Russian language
🔗 Useful Links
- PyPI package:
ruaccent-predictor - Repository: https://github.com/kubataba/Russian-Stress-Accent-Predictor
- Dataset: https://huggingface.co/datasets/nevmenandr/accentual-syllabic-verse-in-russian-prose
- PyTorch Documentation: https://pytorch.org/docs/stable/index.html
Package Version: 1.1.0
Package Name: ruaccent-predictor
Last Updated: February 2026
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ruaccent_predictor-1.1.0.tar.gz.
File metadata
- Download URL: ruaccent_predictor-1.1.0.tar.gz
- Upload date:
- Size: 28.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8857c97d1f75652be44b7e0319d5a6410a57a38120420ff9c3c8bfa2de4ea344
|
|
| MD5 |
8407607f27004fc0668a593d27c413cc
|
|
| BLAKE2b-256 |
14fed037d3a8e02e9135935ff52623d87479daf33dcfa26325f7742ca356499d
|
File details
Details for the file ruaccent_predictor-1.1.0-py3-none-any.whl.
File metadata
- Download URL: ruaccent_predictor-1.1.0-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
debf3c71b0ed6fc1ee6e249c6adfbcef6c1b44133dfe52434beb0e0573a69681
|
|
| MD5 |
c6609c11031f708716c18a931bb2733f
|
|
| BLAKE2b-256 |
1028f3c91ec512494b84b6736179f7674617044c037ad6c37eadb9a6b3c69c42
|