Luscinia: LightGBM/ONNX stress predictor for out-of-vocabulary Ukrainian words
Project description
Luscinia
Luscinia is a machine learning model for predicting stress (accent) positions in out-of-vocabulary Ukrainian words. It uses a LightGBM model exported to ONNX format for efficient inference.
Features
- 99.44% accuracy on held-out validation data
- 132 linguistic features including character n-grams, vowel patterns, suffix analysis, and POS tags
- ONNX Runtime for fast, cross-platform inference
- Batch prediction support for efficient processing
- Zero external model downloads — model bundled in the package (~30 MB)
- Type hints and comprehensive documentation
Installation
pip install luscinia
Quick Start
from luscinia import LusciniaPredictor
# Initialize predictor (loads bundled ONNX model)
predictor = LusciniaPredictor()
# Predict stress position (returns 0-based vowel index)
stress_idx = predictor.predict("університет")
print(stress_idx) # 4 (5th vowel is stressed: універси<те>т)
# Provide POS tag for better accuracy
stress_idx = predictor.predict("виходити", pos="VERB")
print(stress_idx) # 0 (<ви>ходити)
# Batch prediction
words = ["мама", "тато", "дитина"]
indices = predictor.predict_batch(words)
print(indices) # [0, 0, 2]
# Get probability distributions
probs = predictor.predict_proba("університет")
print(probs[4]) # High probability for position 4
Advanced Usage
Custom Model Path
# Load a custom ONNX model
predictor = LusciniaPredictor(model_path="path/to/custom_model.onnx.gz")
Performance Tuning
import onnxruntime as ort
# Custom session options
session_options = ort.SessionOptions()
session_options.intra_op_num_threads = 4 # Multi-threading
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
predictor = LusciniaPredictor(session_options=session_options)
Batch Processing with POS Tags
words = ["читати", "читання", "читач"]
pos_tags = ["VERB", "NOUN", "NOUN"]
indices = predictor.predict_batch(words, pos_tags=pos_tags)
Model Information
print(predictor.model_info)
# {
# 'num_features': 132,
# 'num_classes': 11,
# 'version': 'luscinia-lgbm-str-ua-univ-v1.0',
# 'opset': 15,
# ...
# }
Use Case
Luscinia is designed for out-of-vocabulary (OOV) stress prediction. For maximum accuracy:
- Check dictionary first — Use ua-stress-engine for dictionary lookup (2.7M word forms with full morphology)
- Fall back to Luscinia — For unknown words, use Luscinia to predict stress
# Recommended workflow
from ukrainian_stress import lookup # from ua-stress-engine
from luscinia import LusciniaPredictor
predictor = LusciniaPredictor()
def get_stress_index(word, pos=None):
# Try dictionary first
result = lookup(word)
if result['readings']:
return result['readings'][0]['syllable_index']
# Fall back to ML prediction
return predictor.predict(word, pos=pos)
Supported POS Tags
The model supports the following part-of-speech tags for improved accuracy:
NOUN— nounVERB— verbADJ— adjectiveADV— adverbNUM— numeralPRON— pronounDET— determinerPART— particleCONJ— conjunctionADP— adpositionINTJ— interjectionX— other
POS tags are optional but recommended when available.
Model Details
- Model: luscinia-lgbm-str-ua-univ-v1
- Algorithm: LightGBM multiclass (11 classes for up to 11-syllable words)
- Training Data: 2.7M Ukrainian word forms
- Features: 132 (100 base + 32 universal extensions)
- Accuracy: 99.44% on validation set
- Export Format: ONNX (opset 15)
- File Size: ~30 MB (gzip compressed)
Performance
- Single prediction: ~1-2 ms (CPU)
- Batch prediction (100 words): ~10-20 ms (CPU)
- Model loading: ~100-200 ms (one-time cost)
Requirements
- Python 3.8+
- numpy >= 1.24.0
- onnxruntime >= 1.16.0
License
AGPL-3.0-or-later
Citation
If you use Luscinia in academic work, please cite:
Lukan, R. (2026). Luscinia: A LightGBM-based stress predictor for Ukrainian.
https://github.com/Rostyslav-Lukan/ua-stress-engine
Related Projects
- ua-stress-engine — Ukrainian stress dictionary with Rust extension
- ua-word-stress — JavaScript/TypeScript dictionary (npm)
- ua-word-stress-wasm — WebAssembly version (npm)
Contributing
Contributions welcome! See GitHub repository for details.
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file luscinia-1.0.0.tar.gz.
File metadata
- Download URL: luscinia-1.0.0.tar.gz
- Upload date:
- Size: 31.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a29c8fe8b36a454f6184f9b22f870dfcd6a682902b25ff7256d30ae960dc0b4
|
|
| MD5 |
35e77a2e43225ee5414d622aff7f1a1f
|
|
| BLAKE2b-256 |
4e3d5ec80f8c5f3520ef626c85b308ad817ea3de1a5b4e14d3dc028c997b1a93
|
File details
Details for the file luscinia-1.0.0-py3-none-any.whl.
File metadata
- Download URL: luscinia-1.0.0-py3-none-any.whl
- Upload date:
- Size: 31.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcf9b4c7f3147fce43e72b116455f76158a3e20f299892c2317f11010f4279d5
|
|
| MD5 |
160512d01ab414e0c9b0ac90d6972f3c
|
|
| BLAKE2b-256 |
0e9511a8affb9d46589280b42bcc216a5e1b9005809e7e2c6fae5484b40221b1
|