Skip to main content

Russian stress accent prediction using Transformer model

Project description

Russian Stress Accent Predictor (Accentor) - ruaccent-predictor

License: MIT Python 3.8+ PyTorch

Automatic stress accent placement in Russian text using a character-level Transformer model. Available on PyPI as ruaccent-predictor.

📋 Description

This project is a deep learning model for automatic stress accent placement in Russian text. The model is trained on a dataset of over 224,000 sentence pairs from literary works and achieves 99.7% accuracy on the validation dataset.

Key Features

  • ✅ 99.7% accuracy on validation dataset
  • 🚀 Two output formats: apostrophe (я́) and synthesis (+я)
  • ⚡ Batch processing support for speed optimization
  • 💾 Built-in result caching
  • 🔧 Support for CPU, CUDA, and Apple MPS (Metal)
  • 📦 Easy pip installation: pip install ruaccent-predictor

Technical Details

Character-Level Model: The model operates at the character level with an automatically extracted vocabulary of 224 characters from the training dataset. This approach allows for high accuracy while maintaining a compact model size (~12.5M parameters).

Vocabulary: Automatically extracted from the training corpus, includes:

  • Cyrillic letters (uppercase and lowercase)
  • Basic punctuation
  • Latin letters
  • Special tokens

Output Formats

Apostrophe format: Stress mark is placed after the stressed vowel
Example: В лесу' родила'сь ёлочка (Optimal for reading with stress marks during learning)

Synthesis format: Plus sign is placed before the stressed vowel
Example: В лес+у родил+ась ёлочка (Optimal for speech synthesis, e.g., Silero TTS)

⚠️ Model Limitations

The model has the following known limitations:

  1. Does not restore missing letter "ё": The model works with the input text as-is and does not replace "е" with "ё"
  2. Does not mark stress on "ё": Since "ё" is always stressed in Russian, the model does not place additional stress marks on it
  3. Single-vowel words: Words with only one vowel are not marked as they are inherently stressed
  4. No grammatical analysis: The model operates purely on character sequences without morphological or syntactic analysis
  5. Training data limitations: Accuracy may vary for texts outside the literary domain of the training data

📦 PyPI Installation

The package is available on PyPI as ruaccent-predictor:

pip install ruaccent-predictor

Usage as Python Package

from ruaccent import load_accentor

# Load the model  

accentor = load_accentor()

# Predict stress accents  

text = "привет мир"
result = accentor(text)
print(result)  # приве'т мир

Usage as CLI Tool

After installation, use the ruaccent command:

# Process single text
ruaccent "привет как дела"

# Process file
ruaccent --input-file input.txt --output-file output.txt

# Synthesis format
ruaccent "привет" --format synthesis

# Both formats
ruaccent "текст" --format both

# Pipe input
echo "мама мыла раму" | ruaccent 

Available Options:

  • --format: Output format (apostrophe, synthesis, both)
  • --batch-size: Batch size for processing (default: 8)
  • --device: Device for inference (auto, cpu, cuda, mps)
  • --input-file, -i: Input text file
  • --output-file, -o: Output file

🎯 Performance

Benchmarks

  • Accuracy: 99.7% on validation set (22,000 sentences)
  • Speed: ~10 sentences /sec with batch_size=8 on Mac Mini M4
  • Model size: ~12.5M parameters
  • Vocabulary: 224 characters (Cyrillic, punctuation, Latin)

Optimal Settings

# For maximum performance
accentor = load_accentor()
results = accentor(texts, batch_size=8, format='apostrophe')

📁 Project Structure

Russian-Stress-Accent-Predictor/
├── ruaccent/                    # Main package (PyPI)
│   ├── __init__.py
│   ├── accentor.py             # Main module with model
│   └── cli.py                  # CLI interface
├── model/                      # Trained model
│   ├── README.md              # Model documentation
│   ├── acc_model.pt           # Model weights (30MB, Git LFS)
│   └── vocab.json             # Character vocabulary
├── data/                       # Datasets
│   ├── train.csv              # Training set (115MB, Git LFS)
│   └── val.csv                # Validation set (13MB)
├── examples/                   # Usage examples
│   ├── basic_usage.py         # Basic examples
│   └── batch_processing.py    # Batch processing and tests
├── train_scripts/              # Model training scripts
│   ├── model.py               # Transformer architecture
│   ├── prepare_data.py        # Data preparation
│   ├── train_model.py         # Model training
│   └── README.md              # Training instructions
├── .gitattributes             # Git LFS configuration
├── .gitignore                 # Ignored files
├── LICENSE                    # MIT license
├── MANIFEST.in                # Included files for PyPI
├── pyproject.toml             # Package configuration
├── README.md                  # This documentation
├── requirements.txt           # Python dependencies
├── setup.py                   # Package setup
└── run_training.sh            # Training launch script

🧪 Usage Examples

Basic Example (examples/basic_usage.py)

from ruaccent import load_accentor

accentor = load_accentor()
texts = ["привет мир", "мама мыла раму", "солнце светит ярко"]

# Apostrophe format
results = accentor(texts, format='apostrophe')
for original, accented in zip(texts, results):
    print(f"{original}{accented}")

Batch Processing and Tests (examples/batch_processing.py)

python examples/batch_processing.py

Tests performance with different batch sizes, shows cache statistics and optimal settings.

🏗️ Training Scripts

For developers and researchers in the train_scripts/ folder:

Training Scripts

  • model.py - Transformer architecture definition
  • prepare_data.py - Data preprocessing and preparation
  • train_model.py - Main training script

Training from Scratch

# Install dependencies
pip install torch pandas tqdm

# Start training
cd train_scripts
python train_model.py

Note: Training requires significant resources (GPU recommended) and takes several hours.

🔤 Output Formats

1. Apostrophe Format (я').

Apostrophe is placed after the stressed vowel:

  • Input: привет
  • Output: приве'т
  • Use case: Text display, reading

2. Synthesis Format (+я).

Plus sign is placed before the stressed vowel:

  • Input: привет
  • Output: прив+ет
  • Use case: Speech synthesis, TTS systems

🚀 Quick Start

After pip installation:

# Verify installation
ruaccent "тестовая фраза"

# Run examples  
python examples/basic_usage.py  

From Source Code:

# Clone repository  
git clone https://github.com/kubataba/Russian-Stress-Accent-Predictor.git
cd Russian-Stress-Accent-Predictor

# Install in development mode  

pip install -e .

# Use as usual  

ruaccent "ваш текст"

📊 Performance and Caching

The model uses intelligent caching:

  • Cache hits: ~0.0000s per text
  • Cache misses: ~0.5s for first call
  • Optimal batch size: 8 (10 sentences /sec on MPS)
  • Cache size: Up to 10,000 items
# View cache statistics  
cache_info = accentor.cache_info()
print(f"Cache hits: {cache_info['hits']}, misses: {cache_info['misses']}")

# Clear cache  

accentor.clear_cache()

🤝 Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

The project is distributed under the MIT license. See the LICENSE file for details.

The dataset is also distributed under the MIT license:

🙏 Acknowledgments

  • Dataset provided by nevmenandr
  • Project uses the Transformer architecture from PyTorch
  • Inspired by natural language processing tasks for Russian language

🔗 Useful Links


Package Name: ruaccent-predictor
Last Updated: February 2026

Changelog

1.2.0 (2026-02-05)

Improvements:

  • Fixed critical packaging issue where model files were missing from wheel distributions
  • Restructured package layout for better organization (model files now inside package)
  • Improved file loading logic to use relative paths
  • Updated all build configurations for reliability
  • Enhanced installation experience

Version 1.1.0 (2025-02-04)

  • Initial release
  • 99.7% accuracy on validation set
  • Support for apostrophe and synthesis formats

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruaccent_predictor-1.2.0.tar.gz (28.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ruaccent_predictor-1.2.0-py3-none-any.whl (28.1 MB view details)

Uploaded Python 3

File details

Details for the file ruaccent_predictor-1.2.0.tar.gz.

File metadata

  • Download URL: ruaccent_predictor-1.2.0.tar.gz
  • Upload date:
  • Size: 28.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ruaccent_predictor-1.2.0.tar.gz
Algorithm Hash digest
SHA256 0b60bda775ff922bbee16b66c419b70c56569c58fc0c05a4008e8d8b4af650cf
MD5 bd4e27eca8ebae20f767db5b13aeb337
BLAKE2b-256 c10b5c58b1ffb481233f03a6e04febd12ed29b7624434b1ab170356426f35c9f

See more details on using hashes here.

File details

Details for the file ruaccent_predictor-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ruaccent_predictor-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b1ca8204341ae4e5c88e9c1a23a60702d907e60797117cb1e22ea4f0deef7b84
MD5 c06e956230057ac93f22298c3eca33f8
BLAKE2b-256 78ab3fafa54fc407a07dbfbb3be4b2645ec0944ac18824cc6ca08e05aba5b58c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page