Skip to main content

Ukrainian NLP backend for word stress prediction — Luscinia lightgbm model (99.44 % accuracy) with ONNX browser export.

Project description

ua-stress-engine

Ukrainian word stress engine — dictionary lookup with full IPA transcription, ML stress prediction, and published packages for Python, Node.js, and the browser.

The centrepiece is Luscinia — a LightGBM model that predicts the stressed vowel in any Ukrainian word with 99.44 % accuracy across all syllable counts. The model is also exported to ONNX for browser-side inference via onnxruntime-web.

Published packages

Package Registry Source Description
ua-word-stress npm packages/ua-stress-web/ Zero-dependency TypeScript trie (~9 MB, browser + Node)
ua-word-stress-wasm npm crates/wasm/ Rust/WASM — full IPA, morphology, batch API
ua-stress-ml npm packages/ua-stress-ml/ ONNX Luscinia predictor for OOV words (browser/worker)
ua-stress-engine PyPI crates/python/ PyO3 extension (ukrainian_stress) — same API as WASM
luscinia PyPI packages/luscinia/ Python ONNX Luscinia predictor (OOV fallback)

Highlights

Model luscinia-lgbm-str-ua-univ-v1
Task Ukrainian word stress prediction (multiclass, vowel-ordinal)
Accuracy 99.44 % (sanity sample) · 192 / 197 hand-checked
Syllable coverage 2 – 10 + syllable words, single universal model
Features 132 linguistic / hash features
Runtimes lightgbm (Python) · ONNX (browser via onnxruntime-web)
Training data 2.875 M word forms
License AGPL-3.0

Installation

JavaScript / TypeScript

# Trie-based lookup (browser + Node, no WASM)
npm install ua-word-stress

# Full engine — IPA, morphology, batch API (WASM, bundler required)
npm install ua-word-stress-wasm

Python

Published packages:

pip install ua-stress-engine
pip install luscinia

The package is a compiled Rust extension (PyO3 + maturin). Runtime Python dependencies:

Extra Packages Purpose
(core) none Dictionary lookup + IPA via Rust extension
ml lightgbm>=4.0, numpy>=1.24 Luscinia LightGBM resolver
nlp spacy>=3.7 spaCy tokenization pipeline
full all of the above Everything

For local development (requires Rust toolchain and maturin):

pip install maturin
pip install -e '.[full]'   # builds the Rust extension in-place

Quick start — JavaScript / TypeScript

Trie-based lookup (ua-word-stress)

Pure TypeScript, no WASM, works in any environment:

import { UaStressTrie } from "ua-word-stress";

const trie = new UaStressTrie();
trie.mark("університет"); // → 'університе́т'
trie.lookup("замок"); // → 0 (first syllable — замок-lock)
trie.markBatch(["мама", "тато"]); // → ['ма́ма', 'та́то']

Full WASM engine (ua-word-stress-wasm)

Rust/WASM with IPA transcription, morphology, and batch API. No init() call needed — the dictionary loads automatically at module import (bundler target):

import { mark, lookup, stressIndex, lookupMany, markMany, transcribe } from "ua-word-stress-wasm";

mark("університет"); // → 'університе́т'
stressIndex("мама"); // → 0  (0-based syllable index)

const r = lookup("замок");
r.readings[0].stressedForm; // → 'за́мок'
r.readings[0].ipa; // → 'zɑmɔk'
r.readings[0].syllableIndex; // → 0
r.readings[1].stressedForm; // → 'замо́к'  (heteronym)

transcribe("слово", 0); // → { ipa: 'slɔwɔ', ipaSyllables: ['ˈslɔ', 'wɔ'], … }

const batch = lookupMany(["мама", "замок", "xyz"]);
const marked = markMany(["мама", "тато", "xyz"]);

See the WASM package README for the full API reference.

Quick start — Python

Dictionary + IPA (Rust extension)

import ukrainian_stress

ukrainian_stress.mark('університет')   # → 'університе́т'

r = ukrainian_stress.lookup('замок')
r['readings'][0]['stressed_form']      # → 'за́мок'
r['readings'][0]['ipa']               # → 'zɑmɔk'
r['readings'][0]['syllable_index']    # → 0

batch = ukrainian_stress.lookup_many(['мама', 'університет', 'xyz'])
marks = ukrainian_stress.mark_many(['мама', 'тато', 'xyz'])

Full pipeline (Rust dict + LightGBM ML fallback)

Requires pip install -e '.[full]':

from src.stress_resolver.resolver_factory import create_pipeline_kwargs
from src.stress_resolver.pipeline import UkrainianPipeline

pipeline = UkrainianPipeline(**create_pipeline_kwargs())

doc = pipeline.process("Мама варила борщ на кухні.")
for sentence in doc.sentences:
    for token in sentence.tokens:
        print(f"{token.text:15} {token.stress_pattern}")

Raw Luscinia prediction (LightGBM)

import lightgbm as lgb
import numpy as np
from src.stress_prediction.lightgbm.services.feature_service_universal import (
    build_features_universal,
)

MODEL_PATH = (
    "src/stress_prediction/lightgbm/artifacts/"
    "luscinia-lgbm-str-ua-univ-v1/P3_0017_FINAL_FULLDATA/P3_0017_full.lgb"
)
bst = lgb.Booster(model_file=MODEL_PATH)

VOWELS = set("аеєиіїоуюя")

def predict_stress(word: str, pos: str = "NOUN") -> str:
    feat = build_features_universal(word, pos)
    X = np.array(list(feat.values()), dtype=np.float32).reshape(1, -1)
    vowel_idx = int(bst.predict(X).argmax(axis=1)[0])
    vpos = [i for i, c in enumerate(word.lower()) if c in VOWELS]
    cp = vpos[vowel_idx]
    return word[: cp + 1] + "\u0301" + word[cp + 1 :]

print(predict_stress("університет", "NOUN"))  # → університе́т
print(predict_stress("читати",      "VERB"))  # → чита́ти

POS tags — use Universal Dependencies tags: NOUN VERB ADJ ADV PRON DET NUM PART CCONJ X. Pass "X" when POS is unknown.

Quick start — browser (ONNX)

The 30 MB gzip-compressed ONNX artifact (P3_0017_full.onnx.gz) is stored in Git LFS. Serve it with Content-Encoding: gzip so browsers decompress it transparently.

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create(
  "/models/P3_0017_full.onnx.gz",
);

// Build a Float32Array of 132 features (see manifest.json for order)
const tensor = new ort.Tensor("float32", featureArray, [1, 132]);
const results = await session.run({ float_input: tensor });
const vowelIndex = Number(results["label"].data[0]);

See src/stress_prediction/lightgbm/documentation/LUSCINIA_LGBM_V1_DEPLOYMENT.md for the full deployment guide (nginx / Express serving, batch inference, feature order).

API status

Canonical API contract is documented in documentation/API_DESIGN.md. Current runtime/API mapping:

  • ua-word-stress (npm): trie lookup API (lookup, lookupFull, mark, batch methods)
  • ua-word-stress-wasm (npm): full Rust API (lookup, lookupBatch/lookupMany, mark, markBatch/markMany, stressIndex, transcribe)
  • ua-stress-engine (PyPI): Python binding module ukrainian_stress incl. batch (lookup_many, mark_many)
  • ua-stress-ml (npm) and luscinia (PyPI): ML OOV fallback predictors (132-feature Luscinia ONNX model)

Modules

Module Path What it does
ua-word-stress (npm) packages/ua-stress-web/ Zero-dependency TypeScript trie — mark, lookup, batch API
ua-word-stress-wasm (npm) crates/wasm/ Rust/WASM — IPA, morphology, batch API, no init() required
ua-stress-ml (npm) packages/ua-stress-ml/ Browser/worker ONNX Luscinia predictor for OOV stress
ukrainian_stress (Python) crates/python/ PyO3 extension — same API as WASM, for Python
luscinia (Python) packages/luscinia/ Python ONNX Luscinia predictor package (PyPI)
Rust core crates/core/ Dictionary embed, phonetic pipeline, syllabifier (shared by WASM + Python)
ML resolver (LightGBM) src/stress_prediction/lightgbm/ Luscinia model — 99.44 % accuracy, 132 features, ONNX export
NLP pipeline src/stress_resolver/ spaCy tokenization → Rust dict lookup → ML fallback
Data management src/data_management/ Source parsers, master SQLite DB builder, binary trie exporter

Project structure

ua-stress-engine/
├── crates/
│   ├── core/                      # Rust core library (dict embed, phonetics, syllabifier)
│   ├── wasm/                      # ua-word-stress-wasm (wasm-pack, bundler target)
│   │   ├── src/lib.rs
│   │   └── pkg/                   # built npm package (gitignored except README + package.json)
│   ├── python/                    # ukrainian_stress PyO3 extension (maturin)
│   │   └── src/lib.rs
│   └── builder/                   # CLI tool to compile the embedded binary dictionary
├── packages/
│   └── ua-stress-web/             # ua-word-stress npm package (TypeScript, zero deps)
│       ├── src/                   # UaStressTrie.ts, types.ts, utils.ts
│       ├── tests/
│       └── package.json
├── src/
│   ├── stress_resolver/           # Python NLP pipeline + resolver chain
│   │   ├── pipeline.py            # UkrainianPipeline
│   │   ├── stress_resolver.py     # Rust-extension-based resolver
│   │   ├── ml_stress_resolver.py  # LightGBM-based resolver
│   │   └── resolver_factory.py    # Auto-configure resolver chain
│   ├── nlp/
│   │   ├── stress_service/        # Stress lookup wrapper
│   │   ├── phonetic/              # IPA transcription (Python side)
│   │   └── tokenization_service/  # spaCy tokenizer wrapper
│   ├── stress_prediction/
│   │   └── lightgbm/              # Luscinia model, training scripts, services, artifacts
│   └── data_management/
│       ├── sources/               # Source parsers (kaikki, trie, txt, variative)
│       ├── transform/             # Master DB builder (SQLite)
│       └── export/
│           └── web_stress_db/     # Binary .ctrie builder → packages/ua-stress-web/data/
├── build_master_db.py             # Build master SQLite from all sources
├── build_web_stress_db.py         # Build + export binary trie
├── pyproject.toml                 # maturin build config (points to crates/python/)
└── tests/
    └── src/
        ├── stress_resolver/       # Pipeline + resolver tests
        ├── stress_prediction/     # LightGBM model tests
        ├── data_management/       # Source parser + DB tests
        └── nlp/                   # Stress service tests

Data sources

The embedded dictionary is compiled from five open Ukrainian stress resources:

Source License Entries Notes
kaikki.org Ukrainian — Wiktionary extract CC BY-SA 4.0 ~2 M inflected forms POS + full morphology
lang-uk/ukrainian-word-stress — marisa-trie MIT ~2.9 M word forms compact trie with morph tags
lang-uk/ukrainian-word-stress-dictionary — text dict see upstream ~2.9 M word forms based on ULIF / NASU corpora
bakustarver/ukr-dictionaries-list-opensource — SUM11 DiktJson (ukr-ukr_SUM-11_or_1) public domain (original SUM-11), digitised JSON ~127 K lemmas classic 11-volume explanatory dictionary
ua_variative_stressed_words — curated free-variant list original work ~150 lemmas marks freely variable stress

All five sources are merged into a single master SQLite (~680 MB) and then compiled into the embedded binary (ua_stress.bin.bz2) shipped inside the Rust crates.

Running tests

# Python tests (requires ml + nlp extras installed)
python -m pytest tests/ -q

# TypeScript trie package
cd packages/ua-stress-web && pnpm test

# WASM package
cd crates/wasm && wasm-pack test --node

Large files (Git LFS)

The following binary artifacts are stored in Git LFS:

File Size
P3_0017_full.lgb 259 MB
P3_0017_full.onnx 185 MB
P3_0017_full.onnx.gz 30 MB
stress.lmdb varies

License

AGPL-3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ua_stress_engine-1.0.2.tar.gz (15.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ua_stress_engine-1.0.2-cp313-cp313-win_amd64.whl (31.1 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file ua_stress_engine-1.0.2.tar.gz.

File metadata

  • Download URL: ua_stress_engine-1.0.2.tar.gz
  • Upload date:
  • Size: 15.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for ua_stress_engine-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c1a16f359878fb07b45a6a9c592dfa5d0e1e4d7e166c3334317ccfb10725d191
MD5 bec6962790d570b20c4cafce07dce96a
BLAKE2b-256 6150f9197e7fc23b072d9dbc91aba09ac03686d75638ae5b6265c60a26cad61a

See more details on using hashes here.

File details

Details for the file ua_stress_engine-1.0.2-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for ua_stress_engine-1.0.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 ff258910b235e74c3a2ff0ce41a1e2250255067ea8c596ca9816eb086f926ab8
MD5 e82d2c8f818af3831a61209cb2d08ee9
BLAKE2b-256 d6445dc741cb71852d0db79acab38d9d186d85e009b7a096ed31bc54b2647c39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page