Skip to main content

Russian stress accent prediction using Transformer model

Project description

Russian Stress Accent Predictor (Accentor) - ruaccent-predictor

License: MIT Python 3.8+ PyTorch

Automatic stress accent placement in Russian text using a character-level Transformer model. Available on PyPI as ruaccent-predictor.

📋 Description

This project is a deep learning model for automatic stress accent placement in Russian text. The model is trained on a dataset of over 224,000 sentence pairs from literary works and achieves 99.7% accuracy on the validation dataset.

Key Features

  • ✅ 99.7% accuracy on validation dataset
  • 🚀 Two output formats: apostrophe (я́) and synthesis (+я)
  • ⚡ Batch processing support for speed optimization
  • 💾 Built-in result caching
  • 🔧 Support for CPU, CUDA, and Apple MPS (Metal)
  • 📦 Easy pip installation: pip install ruaccent-predictor

Technical Details

Character-Level Model: The model operates at the character level with an automatically extracted vocabulary of 224 characters from the training dataset. This approach allows for high accuracy while maintaining a compact model size (~12.5M parameters).

Vocabulary: Automatically extracted from the training corpus, includes:

  • Cyrillic letters (uppercase and lowercase)
  • Basic punctuation
  • Latin letters
  • Special tokens

Output Formats

Apostrophe format: Stress mark is placed after the stressed vowel
Example: В лесу' родила'сь ёлочка (Optimal for reading with stress marks during learning)

Synthesis format: Plus sign is placed before the stressed vowel
Example: В лес+у родил+ась ёлочка (Optimal for speech synthesis, e.g., Silero TTS)

⚠️ Model Limitations

The model has the following known limitations:

  1. Does not restore missing letter "ё": The model works with the input text as-is and does not replace "е" with "ё"
  2. Does not mark stress on "ё": Since "ё" is always stressed in Russian, the model does not place additional stress marks on it
  3. Single-vowel words: Words with only one vowel are not marked as they are inherently stressed
  4. No grammatical analysis: The model operates purely on character sequences without morphological or syntactic analysis
  5. Training data limitations: Accuracy may vary for texts outside the literary domain of the training data

📦 PyPI Installation

The package is available on PyPI as ruaccent-predictor:

pip install ruaccent-predictor

Usage as Python Package

from ruaccent import load_accentor

# Load the model  

accentor = load_accentor()

# Predict stress accents  

text = "привет мир"
result = accentor(text)
print(result)  # приве'т мир

Usage as CLI Tool

After installation, use the ruaccent command:

# Process single text
ruaccent "привет как дела"

# Process file
ruaccent --input-file input.txt --output-file output.txt

# Synthesis format
ruaccent "привет" --format synthesis

# Both formats
ruaccent "текст" --format both

# Pipe input
echo "мама мыла раму" | ruaccent 

Available Options:

  • --format: Output format (apostrophe, synthesis, both)
  • --batch-size: Batch size for processing (default: 8)
  • --device: Device for inference (auto, cpu, cuda, mps)
  • --input-file, -i: Input text file
  • --output-file, -o: Output file

🎯 Performance

Benchmarks

  • Accuracy: 99.7% on validation set (22,000 sentences)
  • Speed: ~10 sentences /sec with batch_size=8 on Mac Mini M4
  • Model size: ~12.5M parameters
  • Vocabulary: 224 characters (Cyrillic, punctuation, Latin)

Optimal Settings

# For maximum performance
accentor = load_accentor()
results = accentor(texts, batch_size=8, format='apostrophe')

📁 Project Structure

Russian-Stress-Accent-Predictor/
├── ruaccent/                    # Main package (PyPI)
│   ├── __init__.py
│   ├── accentor.py             # Main module with model
│   └── cli.py                  # CLI interface
├── model/                      # Trained model
│   ├── README.md              # Model documentation
│   ├── acc_model.pt           # Model weights (30MB, Git LFS)
│   └── vocab.json             # Character vocabulary
├── data/                       # Datasets
│   ├── train.csv              # Training set (115MB, Git LFS)
│   └── val.csv                # Validation set (13MB)
├── examples/                   # Usage examples
│   ├── basic_usage.py         # Basic examples
│   └── batch_processing.py    # Batch processing and tests
├── train_scripts/              # Model training scripts
│   ├── model.py               # Transformer architecture
│   ├── prepare_data.py        # Data preparation
│   ├── train_model.py         # Model training
│   └── README.md              # Training instructions
├── .gitattributes             # Git LFS configuration
├── .gitignore                 # Ignored files
├── LICENSE                    # MIT license
├── MANIFEST.in                # Included files for PyPI
├── pyproject.toml             # Package configuration
├── README.md                  # This documentation
├── requirements.txt           # Python dependencies
├── setup.py                   # Package setup
└── run_training.sh            # Training launch script

🧪 Usage Examples

Basic Example (examples/basic_usage.py)

from ruaccent import load_accentor

accentor = load_accentor()
texts = ["привет мир", "мама мыла раму", "солнце светит ярко"]

# Apostrophe format
results = accentor(texts, format='apostrophe')
for original, accented in zip(texts, results):
    print(f"{original}{accented}")

Batch Processing and Tests (examples/batch_processing.py)

python examples/batch_processing.py

Tests performance with different batch sizes, shows cache statistics and optimal settings.

🏗️ Training Scripts

For developers and researchers in the train_scripts/ folder:

Training Scripts

  • model.py - Transformer architecture definition
  • prepare_data.py - Data preprocessing and preparation
  • train_model.py - Main training script

Training from Scratch

# Install dependencies
pip install torch pandas tqdm

# Start training
cd train_scripts
python train_model.py

Note: Training requires significant resources (GPU recommended) and takes several hours.

🔤 Output Formats

1. Apostrophe Format (я').

Apostrophe is placed after the stressed vowel:

  • Input: привет
  • Output: приве'т
  • Use case: Text display, reading

2. Synthesis Format (+я).

Plus sign is placed before the stressed vowel:

  • Input: привет
  • Output: прив+ет
  • Use case: Speech synthesis, TTS systems

🚀 Quick Start

After pip installation:

# Verify installation
ruaccent "тестовая фраза"

# Run examples  
python examples/basic_usage.py  

From Source Code:

# Clone repository  
git clone https://github.com/kubataba/Russian-Stress-Accent-Predictor.git
cd Russian-Stress-Accent-Predictor

# Install in development mode  

pip install -e .

# Use as usual  

ruaccent "ваш текст"

📊 Performance and Caching

The model uses intelligent caching:

  • Cache hits: ~0.0000s per text
  • Cache misses: ~0.5s for first call
  • Optimal batch size: 8 (10 sentences /sec on MPS)
  • Cache size: Up to 10,000 items
# View cache statistics  
cache_info = accentor.cache_info()
print(f"Cache hits: {cache_info['hits']}, misses: {cache_info['misses']}")

# Clear cache  

accentor.clear_cache()

🤝 Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

The project is distributed under the MIT license. See the LICENSE file for details.

The dataset is also distributed under the MIT license:

🙏 Acknowledgments

  • Dataset provided by nevmenandr
  • Project uses the Transformer architecture from PyTorch
  • Inspired by natural language processing tasks for Russian language

🔗 Useful Links


Package Version: 1.1.0
Package Name: ruaccent-predictor
Last Updated: February 2026

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruaccent_predictor-1.1.0.tar.gz (28.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ruaccent_predictor-1.1.0-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file ruaccent_predictor-1.1.0.tar.gz.

File metadata

  • Download URL: ruaccent_predictor-1.1.0.tar.gz
  • Upload date:
  • Size: 28.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ruaccent_predictor-1.1.0.tar.gz
Algorithm Hash digest
SHA256 8857c97d1f75652be44b7e0319d5a6410a57a38120420ff9c3c8bfa2de4ea344
MD5 8407607f27004fc0668a593d27c413cc
BLAKE2b-256 14fed037d3a8e02e9135935ff52623d87479daf33dcfa26325f7742ca356499d

See more details on using hashes here.

File details

Details for the file ruaccent_predictor-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ruaccent_predictor-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 debf3c71b0ed6fc1ee6e249c6adfbcef6c1b44133dfe52434beb0e0573a69681
MD5 c6609c11031f708716c18a931bb2733f
BLAKE2b-256 1028f3c91ec512494b84b6736179f7674617044c037ad6c37eadb9a6b3c69c42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page