Ukrainian NLP backend for word stress prediction — Luscinia lightgbm model (99.44 % accuracy) with ONNX browser export.
Project description
ua-stress-engine
Ukrainian word stress engine — dictionary lookup with full IPA transcription, ML stress prediction, and published packages for Python, Node.js, and the browser.
The centrepiece is Luscinia — a LightGBM model that predicts the stressed vowel in
any Ukrainian word with 99.44 % accuracy across all syllable counts.
The model is also exported to ONNX for browser-side inference via onnxruntime-web.
Published packages
| Package | Registry | Source | Description |
|---|---|---|---|
ua-word-stress |
npm | packages/ua-stress-web/ |
Zero-dependency TypeScript trie (~9 MB, browser + Node) |
ua-word-stress-wasm |
npm | crates/wasm/ |
Rust/WASM — full IPA, morphology, batch API |
ua-stress-engine |
PyPI (planned) | crates/python/ |
PyO3 extension — same API as WASM, for Python |
Highlights
| Model | luscinia-lgbm-str-ua-univ-v1 |
| Task | Ukrainian word stress prediction (multiclass, vowel-ordinal) |
| Accuracy | 99.44 % (sanity sample) · 192 / 197 hand-checked |
| Syllable coverage | 2 – 10 + syllable words, single universal model |
| Features | 132 linguistic / hash features |
| Runtimes | lightgbm (Python) · ONNX (browser via onnxruntime-web) |
| Training data | 2.7 M word forms |
| License | AGPL-3.0 |
Installation
JavaScript / TypeScript
# Trie-based lookup (browser + Node, no WASM)
npm install ua-word-stress
# Full engine — IPA, morphology, batch API (WASM, bundler required)
npm install ua-word-stress-wasm
Python
The package is a compiled Rust extension (PyO3 + maturin). Runtime Python dependencies:
| Extra | Packages | Purpose |
|---|---|---|
| (core) | none | Dictionary lookup + IPA via Rust extension |
ml |
lightgbm>=4.0, numpy>=1.24 |
Luscinia LightGBM resolver |
nlp |
spacy>=3.7 |
spaCy tokenization pipeline |
full |
all of the above | Everything |
For local development (requires Rust toolchain and maturin):
pip install maturin
pip install -e '.[full]' # builds the Rust extension in-place
Quick start — JavaScript / TypeScript
Trie-based lookup (ua-word-stress)
Pure TypeScript, no WASM, works in any environment:
import { UaStressTrie } from "ua-word-stress";
const trie = new UaStressTrie();
trie.mark("університет"); // → 'університе́т'
trie.lookup("замок"); // → 0 (first syllable — замок-lock)
trie.markBatch(["мама", "тато"]); // → ['ма́ма', 'та́то']
Full WASM engine (ua-word-stress-wasm)
Rust/WASM with IPA transcription, morphology, and batch API. No init() call needed — the dictionary loads automatically at module import (bundler target):
import { mark, lookup, stressIndex, transcribe } from "ua-word-stress-wasm";
mark("університет"); // → 'університе́т'
stressIndex("мама"); // → 0 (0-based syllable index)
const r = lookup("замок");
r.readings[0].stressedForm; // → 'за́мок'
r.readings[0].ipa; // → 'zɑmɔk'
r.readings[0].syllableIndex; // → 0
r.readings[1].stressedForm; // → 'замо́к' (heteronym)
transcribe("слово", 0); // → { ipa: 'slɔwɔ', ipaSyllables: ['ˈslɔ', 'wɔ'], … }
See the WASM package README for the full API reference.
Quick start — Python
Dictionary + IPA (Rust extension)
import ukrainian_stress
ukrainian_stress.mark('університет') # → 'університе́т'
r = ukrainian_stress.lookup('замок')
r['readings'][0]['stressed_form'] # → 'за́мок'
r['readings'][0]['ipa'] # → 'zɑmɔk'
r['readings'][0]['syllable_index'] # → 0
Full pipeline (Rust dict + LightGBM ML fallback)
Requires pip install -e '.[full]':
from src.stress_resolver.resolver_factory import create_pipeline_kwargs
from src.stress_resolver.pipeline import UkrainianPipeline
pipeline = UkrainianPipeline(**create_pipeline_kwargs())
doc = pipeline.process("Мама варила борщ на кухні.")
for sentence in doc.sentences:
for token in sentence.tokens:
print(f"{token.text:15} {token.stress_pattern}")
Raw Luscinia prediction (LightGBM)
import lightgbm as lgb
import numpy as np
from src.stress_prediction.lightgbm.services.feature_service_universal import (
build_features_universal,
)
MODEL_PATH = (
"src/stress_prediction/lightgbm/artifacts/"
"luscinia-lgbm-str-ua-univ-v1/P3_0017_FINAL_FULLDATA/P3_0017_full.lgb"
)
bst = lgb.Booster(model_file=MODEL_PATH)
VOWELS = set("аеєиіїоуюя")
def predict_stress(word: str, pos: str = "NOUN") -> str:
feat = build_features_universal(word, pos)
X = np.array(list(feat.values()), dtype=np.float32).reshape(1, -1)
vowel_idx = int(bst.predict(X).argmax(axis=1)[0])
vpos = [i for i, c in enumerate(word.lower()) if c in VOWELS]
cp = vpos[vowel_idx]
return word[: cp + 1] + "\u0301" + word[cp + 1 :]
print(predict_stress("університет", "NOUN")) # → університе́т
print(predict_stress("читати", "VERB")) # → чита́ти
POS tags — use Universal Dependencies tags:
NOUN VERB ADJ ADV PRON DET NUM PART CCONJ X. Pass"X"when POS is unknown.
Quick start — browser (ONNX)
The 30 MB gzip-compressed ONNX artifact (P3_0017_full.onnx.gz) is stored in
Git LFS. Serve it with Content-Encoding: gzip so browsers decompress it
transparently.
import * as ort from "onnxruntime-web";
const session = await ort.InferenceSession.create(
"/models/P3_0017_full.onnx.gz",
);
// Build a Float32Array of 132 features (see manifest.json for order)
const tensor = new ort.Tensor("float32", featureArray, [1, 132]);
const results = await session.run({ float_input: tensor });
const vowelIndex = Number(results["label"].data[0]);
See src/stress_prediction/lightgbm/documentation/LUSCINIA_LGBM_V1_DEPLOYMENT.md for the full deployment guide (nginx / Express serving, batch inference, feature order).
Modules
| Module | Path | What it does |
|---|---|---|
ua-word-stress (npm) |
packages/ua-stress-web/ |
Zero-dependency TypeScript trie — mark, lookup, batch API |
ua-word-stress-wasm (npm) |
crates/wasm/ |
Rust/WASM — IPA, morphology, batch API, no init() required |
ukrainian_stress (Python) |
crates/python/ |
PyO3 extension — same API as WASM, for Python |
| Rust core | crates/core/ |
Dictionary embed, phonetic pipeline, syllabifier (shared by WASM + Python) |
| ML resolver (LightGBM) | src/stress_prediction/lightgbm/ |
Luscinia model — 99.44 % accuracy, 132 features, ONNX export |
| NLP pipeline | src/stress_resolver/ |
spaCy tokenization → Rust dict lookup → ML fallback |
| Data management | src/data_management/ |
Source parsers, master SQLite DB builder, binary trie exporter |
Project structure
ua-stress-engine/
├── crates/
│ ├── core/ # Rust core library (dict embed, phonetics, syllabifier)
│ ├── wasm/ # ua-word-stress-wasm (wasm-pack, bundler target)
│ │ ├── src/lib.rs
│ │ └── pkg/ # built npm package (gitignored except README + package.json)
│ ├── python/ # ukrainian_stress PyO3 extension (maturin)
│ │ └── src/lib.rs
│ └── builder/ # CLI tool to compile the embedded binary dictionary
├── packages/
│ └── ua-stress-web/ # ua-word-stress npm package (TypeScript, zero deps)
│ ├── src/ # UaStressTrie.ts, types.ts, utils.ts
│ ├── tests/
│ └── package.json
├── src/
│ ├── stress_resolver/ # Python NLP pipeline + resolver chain
│ │ ├── pipeline.py # UkrainianPipeline
│ │ ├── stress_resolver.py # Rust-extension-based resolver
│ │ ├── ml_stress_resolver.py # LightGBM-based resolver
│ │ └── resolver_factory.py # Auto-configure resolver chain
│ ├── nlp/
│ │ ├── stress_service/ # Stress lookup wrapper
│ │ ├── phonetic/ # IPA transcription (Python side)
│ │ └── tokenization_service/ # spaCy tokenizer wrapper
│ ├── stress_prediction/
│ │ └── lightgbm/ # Luscinia model, training scripts, services, artifacts
│ └── data_management/
│ ├── sources/ # Source parsers (kaikki, trie, txt, variative)
│ ├── transform/ # Master DB builder (SQLite)
│ └── export/
│ └── web_stress_db/ # Binary .ctrie builder → packages/ua-stress-web/data/
├── build_master_db.py # Build master SQLite from all sources
├── build_web_stress_db.py # Build + export binary trie
├── pyproject.toml # maturin build config (points to crates/python/)
└── tests/
└── src/
├── stress_resolver/ # Pipeline + resolver tests
├── stress_prediction/ # LightGBM model tests
├── data_management/ # Source parser + DB tests
└── nlp/ # Stress service tests
Data sources
The embedded dictionary is compiled from four open Ukrainian stress resources:
| Source | License | Entries | Notes |
|---|---|---|---|
| kaikki.org Ukrainian — Wiktionary extract | CC BY-SA 4.0 | ~2 M inflected forms | POS + full morphology |
| lang-uk/ukrainian-word-stress — marisa-trie | MIT | ~2.9 M word forms | compact trie with morph tags |
| lang-uk/ukrainian-word-stress-dictionary — text dict | see upstream | ~2.9 M word forms | based on ULIF / NASU corpora |
ua_variative_stressed_words — curated free-variant list |
original work | ~150 lemmas | marks freely variable stress |
All four sources are merged into a single master SQLite (~680 MB) and then compiled into the embedded binary (ua_stress.bin.bz2) shipped inside the Rust crates.
Running tests
# Python tests (requires ml + nlp extras installed)
python -m pytest tests/ -q
# TypeScript trie package
cd packages/ua-stress-web && pnpm test
# WASM package
cd crates/wasm && wasm-pack test --node
Large files (Git LFS)
The following binary artifacts are stored in Git LFS:
| File | Size |
|---|---|
P3_0017_full.lgb |
259 MB |
P3_0017_full.onnx |
185 MB |
P3_0017_full.onnx.gz |
30 MB |
stress.lmdb |
varies |
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ua_stress_engine-1.0.1.tar.gz.
File metadata
- Download URL: ua_stress_engine-1.0.1.tar.gz
- Upload date:
- Size: 15.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
854a54a70e4659c1747686b4d1da01dd43fb38ed25b6d0d8320b7d15bdadd267
|
|
| MD5 |
37745c04d500db2db5632f31d1dc58af
|
|
| BLAKE2b-256 |
b7105d08d42efe8814aa1febf3b4225ed13ab5dbbb82d2bed74f3d0795c64a71
|
File details
Details for the file ua_stress_engine-1.0.1-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: ua_stress_engine-1.0.1-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 31.1 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10974adfe2460276410a8b5ed610eee853ffa5307980884738afbf76da2bee87
|
|
| MD5 |
641ef3a7e40c160e300f9c52275e7f67
|
|
| BLAKE2b-256 |
74294b9e980cd19ce52fd5fdccbce869bb92e856456d8140bba7e82362a1b0c5
|