Russian stress accent prediction using Transformer model

These details have not been verified by PyPI

Project links

Project description

Russian Stress Accent Predictor (Accentor) - ruaccent-predictor

Automatic stress accent placement in Russian text using a character-level Transformer model. Available on PyPI as ruaccent-predictor.

📋 Description

This project is a deep learning model for automatic stress accent placement in Russian text. The model is trained on a dataset of over 224,000 sentence pairs from literary works and achieves 99.7% accuracy on the validation dataset.

Key Features

✅ 99.7% accuracy on validation dataset
🚀 Two output formats: apostrophe (я́) and synthesis (+я)
⚡ Batch processing support for speed optimization
💾 Built-in result caching
🔧 Support for CPU, CUDA, and Apple MPS (Metal)
📦 Easy pip installation: pip install ruaccent-predictor

Technical Details

Character-Level Model: The model operates at the character level with an automatically extracted vocabulary of 224 characters from the training dataset. This approach allows for high accuracy while maintaining a compact model size (~12.5M parameters).

Vocabulary: Automatically extracted from the training corpus, includes:

Cyrillic letters (uppercase and lowercase)
Basic punctuation
Latin letters
Special tokens

Output Formats

Apostrophe format: Stress mark is placed after the stressed vowel
Example: В лесу' родила'сь ёлочка (Optimal for reading with stress marks during learning)

Synthesis format: Plus sign is placed before the stressed vowel
Example: В лес+у родил+ась ёлочка (Optimal for speech synthesis, e.g., Silero TTS)

⚠️ Model Limitations

The model has the following known limitations:

Does not restore missing letter "ё": The model works with the input text as-is and does not replace "е" with "ё"
Does not mark stress on "ё": Since "ё" is always stressed in Russian, the model does not place additional stress marks on it
Single-vowel words: Words with only one vowel are not marked as they are inherently stressed
No grammatical analysis: The model operates purely on character sequences without morphological or syntactic analysis
Training data limitations: Accuracy may vary for texts outside the literary domain of the training data

📦 PyPI Installation

The package is available on PyPI as ruaccent-predictor:

pip install ruaccent-predictor

Usage as Python Package

from ruaccent import load_accentor

# Load the model  

accentor = load_accentor()

# Predict stress accents  

text = "привет мир"
result = accentor(text)
print(result)  # приве'т мир

Usage as CLI Tool

After installation, use the ruaccent command:

# Process single text
ruaccent "привет как дела"

# Process file
ruaccent --input-file input.txt --output-file output.txt

# Synthesis format
ruaccent "привет" --format synthesis

# Both formats
ruaccent "текст" --format both

# Pipe input
echo "мама мыла раму" | ruaccent

Available Options:

--format: Output format (apostrophe, synthesis, both)
--batch-size: Batch size for processing (default: 8)
--device: Device for inference (auto, cpu, cuda, mps)
--input-file, -i: Input text file
--output-file, -o: Output file

🎯 Performance

Benchmarks

Accuracy: 99.7% on validation set (22,000 sentences)
Speed: ~10 sentences /sec with batch_size=8 on Mac Mini M4
Model size: ~12.5M parameters
Vocabulary: 224 characters (Cyrillic, punctuation, Latin)

Optimal Settings

# For maximum performance
accentor = load_accentor()
results = accentor(texts, batch_size=8, format='apostrophe')

📁 Project Structure

Russian-Stress-Accent-Predictor/
├── ruaccent/                    # Main package (PyPI)
│   ├── __init__.py
│   ├── accentor.py             # Main module with model
│   └── cli.py                  # CLI interface
├── model/                      # Trained model
│   ├── README.md              # Model documentation
│   ├── acc_model.pt           # Model weights (30MB, Git LFS)
│   └── vocab.json             # Character vocabulary
├── data/                       # Datasets
│   ├── train.csv              # Training set (115MB, Git LFS)
│   └── val.csv                # Validation set (13MB)
├── examples/                   # Usage examples
│   ├── basic_usage.py         # Basic examples
│   └── batch_processing.py    # Batch processing and tests
├── train_scripts/              # Model training scripts
│   ├── model.py               # Transformer architecture
│   ├── prepare_data.py        # Data preparation
│   ├── train_model.py         # Model training
│   └── README.md              # Training instructions
├── .gitattributes             # Git LFS configuration
├── .gitignore                 # Ignored files
├── LICENSE                    # MIT license
├── MANIFEST.in                # Included files for PyPI
├── pyproject.toml             # Package configuration
├── README.md                  # This documentation
├── requirements.txt           # Python dependencies
├── setup.py                   # Package setup
└── run_training.sh            # Training launch script

🧪 Usage Examples

Basic Example (examples/basic_usage.py)

from ruaccent import load_accentor

accentor = load_accentor()
texts = ["привет мир", "мама мыла раму", "солнце светит ярко"]

# Apostrophe format
results = accentor(texts, format='apostrophe')
for original, accented in zip(texts, results):
    print(f"{original} → {accented}")

Batch Processing and Tests (examples/batch_processing.py)

python examples/batch_processing.py

Tests performance with different batch sizes, shows cache statistics and optimal settings.

🏗️ Training Scripts

For developers and researchers in the train_scripts/ folder:

Training Scripts

model.py - Transformer architecture definition
prepare_data.py - Data preprocessing and preparation
train_model.py - Main training script

Training from Scratch

# Install dependencies
pip install torch pandas tqdm

# Start training
cd train_scripts
python train_model.py

Note: Training requires significant resources (GPU recommended) and takes several hours.

🔤 Output Formats

1. Apostrophe Format (я').

Apostrophe is placed after the stressed vowel:

Input: привет
Output: приве'т
Use case: Text display, reading

2. Synthesis Format (+я).

Plus sign is placed before the stressed vowel:

Input: привет
Output: прив+ет
Use case: Speech synthesis, TTS systems

🚀 Quick Start

After pip installation:

# Verify installation
ruaccent "тестовая фраза"

# Run examples  
python examples/basic_usage.py

From Source Code:

# Clone repository  
git clone https://github.com/kubataba/Russian-Stress-Accent-Predictor.git
cd Russian-Stress-Accent-Predictor

# Install in development mode  

pip install -e .

# Use as usual  

ruaccent "ваш текст"

📊 Performance and Caching

The model uses intelligent caching:

Cache hits: ~0.0000s per text
Cache misses: ~0.5s for first call
Optimal batch size: 8 (10 sentences /sec on MPS)
Cache size: Up to 10,000 items

# View cache statistics  
cache_info = accentor.cache_info()
print(f"Cache hits: {cache_info['hits']}, misses: {cache_info['misses']}")

# Clear cache  

accentor.clear_cache()

🤝 Contributing

Contributions are welcome!

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

The project is distributed under the MIT license. See the LICENSE file for details.

The dataset is also distributed under the MIT license:

Source: nevmenandr/accentual-syllabic-verse-in-russian-prose
License: MIT

🙏 Acknowledgments

Dataset provided by nevmenandr
Project uses the Transformer architecture from PyTorch
Inspired by natural language processing tasks for Russian language

🔗 Useful Links

PyPI package: ruaccent-predictor
Repository: https://github.com/kubataba/Russian-Stress-Accent-Predictor
Dataset: https://huggingface.co/datasets/nevmenandr/accentual-syllabic-verse-in-russian-prose
PyTorch Documentation: https://pytorch.org/docs/stable/index.html

Package Name: ruaccent-predictor
Last Updated: February 2026

Changelog

1.2.0 (2026-02-05)

Improvements:

Fixed critical packaging issue where model files were missing from wheel distributions
Restructured package layout for better organization (model files now inside package)
Improved file loading logic to use relative paths
Updated all build configurations for reliability
Enhanced installation experience

Version 1.1.0 (2025-02-04)

Initial release
99.7% accuracy on validation set
Support for apostrophe and synthesis formats

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.0

Feb 5, 2026

1.1.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruaccent_predictor-1.2.0.tar.gz (28.1 MB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ruaccent_predictor-1.2.0-py3-none-any.whl (28.1 MB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file ruaccent_predictor-1.2.0.tar.gz.

File metadata

Download URL: ruaccent_predictor-1.2.0.tar.gz
Upload date: Feb 5, 2026
Size: 28.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ruaccent_predictor-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0b60bda775ff922bbee16b66c419b70c56569c58fc0c05a4008e8d8b4af650cf`
MD5	`bd4e27eca8ebae20f767db5b13aeb337`
BLAKE2b-256	`c10b5c58b1ffb481233f03a6e04febd12ed29b7624434b1ab170356426f35c9f`

See more details on using hashes here.

File details

Details for the file ruaccent_predictor-1.2.0-py3-none-any.whl.

File metadata

Download URL: ruaccent_predictor-1.2.0-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 28.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ruaccent_predictor-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1ca8204341ae4e5c88e9c1a23a60702d907e60797117cb1e22ea4f0deef7b84`
MD5	`c06e956230057ac93f22298c3eca33f8`
BLAKE2b-256	`78ab3fafa54fc407a07dbfbb3be4b2645ec0944ac18824cc6ca08e05aba5b58c`

See more details on using hashes here.

ruaccent-predictor 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Russian Stress Accent Predictor (Accentor) - ruaccent-predictor

📋 Description

Key Features

Technical Details

Output Formats

⚠️ Model Limitations

📦 PyPI Installation

Usage as Python Package

Usage as CLI Tool

Available Options:

🎯 Performance

Benchmarks

Optimal Settings

📁 Project Structure

🧪 Usage Examples

Basic Example (examples/basic_usage.py)

Batch Processing and Tests (examples/batch_processing.py)

🏗️ Training Scripts

Training Scripts

Training from Scratch

🔤 Output Formats

1. Apostrophe Format (я').

2. Synthesis Format (+я).

🚀 Quick Start

After pip installation:

From Source Code:

📊 Performance and Caching

🤝 Contributing

📄 License

🙏 Acknowledgments

🔗 Useful Links

Changelog

1.2.0 (2026-02-05)

Version 1.1.0 (2025-02-04)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes