Skip to main content

Ukrainian NLP backend for word stress prediction — Luscinia lightgbm model (99.44 % accuracy) with ONNX browser export.

Project description

ua-stress-engine

Ukrainian word stress engine — dictionary lookup with full IPA transcription, ML stress prediction, and published packages for Python, Node.js, and the browser.

The centrepiece is Luscinia — a LightGBM model that predicts the stressed vowel in any Ukrainian word with 99.44 % accuracy across all syllable counts. The model is also exported to ONNX for browser-side inference via onnxruntime-web.

Published packages

Package Registry Source Description
ua-word-stress npm packages/ua-stress-web/ Zero-dependency TypeScript trie (~9 MB, browser + Node)
ua-word-stress-wasm npm crates/wasm/ Rust/WASM — full IPA, morphology, batch API
ua-stress-engine PyPI (planned) crates/python/ PyO3 extension — same API as WASM, for Python

Highlights

Model luscinia-lgbm-str-ua-univ-v1
Task Ukrainian word stress prediction (multiclass, vowel-ordinal)
Accuracy 99.44 % (sanity sample) · 192 / 197 hand-checked
Syllable coverage 2 – 10 + syllable words, single universal model
Features 132 linguistic / hash features
Runtimes lightgbm (Python) · ONNX (browser via onnxruntime-web)
Training data 2.7 M word forms
License AGPL-3.0

Installation

JavaScript / TypeScript

# Trie-based lookup (browser + Node, no WASM)
npm install ua-word-stress

# Full engine — IPA, morphology, batch API (WASM, bundler required)
npm install ua-word-stress-wasm

Python

The package is a compiled Rust extension (PyO3 + maturin). Runtime Python dependencies:

Extra Packages Purpose
(core) none Dictionary lookup + IPA via Rust extension
ml lightgbm>=4.0, numpy>=1.24 Luscinia LightGBM resolver
nlp spacy>=3.7 spaCy tokenization pipeline
full all of the above Everything

For local development (requires Rust toolchain and maturin):

pip install maturin
pip install -e '.[full]'   # builds the Rust extension in-place

Quick start — JavaScript / TypeScript

Trie-based lookup (ua-word-stress)

Pure TypeScript, no WASM, works in any environment:

import { UaStressTrie } from "ua-word-stress";

const trie = new UaStressTrie();
trie.mark("університет"); // → 'університе́т'
trie.lookup("замок"); // → 0 (first syllable — замок-lock)
trie.markBatch(["мама", "тато"]); // → ['ма́ма', 'та́то']

Full WASM engine (ua-word-stress-wasm)

Rust/WASM with IPA transcription, morphology, and batch API. No init() call needed — the dictionary loads automatically at module import (bundler target):

import { mark, lookup, stressIndex, transcribe } from "ua-word-stress-wasm";

mark("університет"); // → 'університе́т'
stressIndex("мама"); // → 0  (0-based syllable index)

const r = lookup("замок");
r.readings[0].stressedForm; // → 'за́мок'
r.readings[0].ipa; // → 'zɑmɔk'
r.readings[0].syllableIndex; // → 0
r.readings[1].stressedForm; // → 'замо́к'  (heteronym)

transcribe("слово", 0); // → { ipa: 'slɔwɔ', ipaSyllables: ['ˈslɔ', 'wɔ'], … }

See the WASM package README for the full API reference.

Quick start — Python

Dictionary + IPA (Rust extension)

import ukrainian_stress

ukrainian_stress.mark('університет')   # → 'університе́т'

r = ukrainian_stress.lookup('замок')
r['readings'][0]['stressed_form']      # → 'за́мок'
r['readings'][0]['ipa']               # → 'zɑmɔk'
r['readings'][0]['syllable_index']    # → 0

Full pipeline (Rust dict + LightGBM ML fallback)

Requires pip install -e '.[full]':

from src.stress_resolver.resolver_factory import create_pipeline_kwargs
from src.stress_resolver.pipeline import UkrainianPipeline

pipeline = UkrainianPipeline(**create_pipeline_kwargs())

doc = pipeline.process("Мама варила борщ на кухні.")
for sentence in doc.sentences:
    for token in sentence.tokens:
        print(f"{token.text:15} {token.stress_pattern}")

Raw Luscinia prediction (LightGBM)

import lightgbm as lgb
import numpy as np
from src.stress_prediction.lightgbm.services.feature_service_universal import (
    build_features_universal,
)

MODEL_PATH = (
    "src/stress_prediction/lightgbm/artifacts/"
    "luscinia-lgbm-str-ua-univ-v1/P3_0017_FINAL_FULLDATA/P3_0017_full.lgb"
)
bst = lgb.Booster(model_file=MODEL_PATH)

VOWELS = set("аеєиіїоуюя")

def predict_stress(word: str, pos: str = "NOUN") -> str:
    feat = build_features_universal(word, pos)
    X = np.array(list(feat.values()), dtype=np.float32).reshape(1, -1)
    vowel_idx = int(bst.predict(X).argmax(axis=1)[0])
    vpos = [i for i, c in enumerate(word.lower()) if c in VOWELS]
    cp = vpos[vowel_idx]
    return word[: cp + 1] + "\u0301" + word[cp + 1 :]

print(predict_stress("університет", "NOUN"))  # → університе́т
print(predict_stress("читати",      "VERB"))  # → чита́ти

POS tags — use Universal Dependencies tags: NOUN VERB ADJ ADV PRON DET NUM PART CCONJ X. Pass "X" when POS is unknown.

Quick start — browser (ONNX)

The 30 MB gzip-compressed ONNX artifact (P3_0017_full.onnx.gz) is stored in Git LFS. Serve it with Content-Encoding: gzip so browsers decompress it transparently.

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create(
  "/models/P3_0017_full.onnx.gz",
);

// Build a Float32Array of 132 features (see manifest.json for order)
const tensor = new ort.Tensor("float32", featureArray, [1, 132]);
const results = await session.run({ float_input: tensor });
const vowelIndex = Number(results["label"].data[0]);

See src/stress_prediction/lightgbm/documentation/LUSCINIA_LGBM_V1_DEPLOYMENT.md for the full deployment guide (nginx / Express serving, batch inference, feature order).

Modules

Module Path What it does
ua-word-stress (npm) packages/ua-stress-web/ Zero-dependency TypeScript trie — mark, lookup, batch API
ua-word-stress-wasm (npm) crates/wasm/ Rust/WASM — IPA, morphology, batch API, no init() required
ukrainian_stress (Python) crates/python/ PyO3 extension — same API as WASM, for Python
Rust core crates/core/ Dictionary embed, phonetic pipeline, syllabifier (shared by WASM + Python)
ML resolver (LightGBM) src/stress_prediction/lightgbm/ Luscinia model — 99.44 % accuracy, 132 features, ONNX export
NLP pipeline src/stress_resolver/ spaCy tokenization → Rust dict lookup → ML fallback
Data management src/data_management/ Source parsers, master SQLite DB builder, binary trie exporter

Project structure

ua-stress-engine/
├── crates/
│   ├── core/                      # Rust core library (dict embed, phonetics, syllabifier)
│   ├── wasm/                      # ua-word-stress-wasm (wasm-pack, bundler target)
│   │   ├── src/lib.rs
│   │   └── pkg/                   # built npm package (gitignored except README + package.json)
│   ├── python/                    # ukrainian_stress PyO3 extension (maturin)
│   │   └── src/lib.rs
│   └── builder/                   # CLI tool to compile the embedded binary dictionary
├── packages/
│   └── ua-stress-web/             # ua-word-stress npm package (TypeScript, zero deps)
│       ├── src/                   # UaStressTrie.ts, types.ts, utils.ts
│       ├── tests/
│       └── package.json
├── src/
│   ├── stress_resolver/           # Python NLP pipeline + resolver chain
│   │   ├── pipeline.py            # UkrainianPipeline
│   │   ├── stress_resolver.py     # Rust-extension-based resolver
│   │   ├── ml_stress_resolver.py  # LightGBM-based resolver
│   │   └── resolver_factory.py    # Auto-configure resolver chain
│   ├── nlp/
│   │   ├── stress_service/        # Stress lookup wrapper
│   │   ├── phonetic/              # IPA transcription (Python side)
│   │   └── tokenization_service/  # spaCy tokenizer wrapper
│   ├── stress_prediction/
│   │   └── lightgbm/              # Luscinia model, training scripts, services, artifacts
│   └── data_management/
│       ├── sources/               # Source parsers (kaikki, trie, txt, variative)
│       ├── transform/             # Master DB builder (SQLite)
│       └── export/
│           └── web_stress_db/     # Binary .ctrie builder → packages/ua-stress-web/data/
├── build_master_db.py             # Build master SQLite from all sources
├── build_web_stress_db.py         # Build + export binary trie
├── pyproject.toml                 # maturin build config (points to crates/python/)
└── tests/
    └── src/
        ├── stress_resolver/       # Pipeline + resolver tests
        ├── stress_prediction/     # LightGBM model tests
        ├── data_management/       # Source parser + DB tests
        └── nlp/                   # Stress service tests

Data sources

The embedded dictionary is compiled from four open Ukrainian stress resources:

Source License Entries Notes
kaikki.org Ukrainian — Wiktionary extract CC BY-SA 4.0 ~2 M inflected forms POS + full morphology
lang-uk/ukrainian-word-stress — marisa-trie MIT ~2.9 M word forms compact trie with morph tags
lang-uk/ukrainian-word-stress-dictionary — text dict see upstream ~2.9 M word forms based on ULIF / NASU corpora
ua_variative_stressed_words — curated free-variant list original work ~150 lemmas marks freely variable stress

All four sources are merged into a single master SQLite (~680 MB) and then compiled into the embedded binary (ua_stress.bin.bz2) shipped inside the Rust crates.

Running tests

# Python tests (requires ml + nlp extras installed)
python -m pytest tests/ -q

# TypeScript trie package
cd packages/ua-stress-web && pnpm test

# WASM package
cd crates/wasm && wasm-pack test --node

Large files (Git LFS)

The following binary artifacts are stored in Git LFS:

File Size
P3_0017_full.lgb 259 MB
P3_0017_full.onnx 185 MB
P3_0017_full.onnx.gz 30 MB
stress.lmdb varies

License

AGPL-3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ua_stress_engine-1.0.1.tar.gz (15.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ua_stress_engine-1.0.1-cp313-cp313-win_amd64.whl (31.1 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file ua_stress_engine-1.0.1.tar.gz.

File metadata

  • Download URL: ua_stress_engine-1.0.1.tar.gz
  • Upload date:
  • Size: 15.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for ua_stress_engine-1.0.1.tar.gz
Algorithm Hash digest
SHA256 854a54a70e4659c1747686b4d1da01dd43fb38ed25b6d0d8320b7d15bdadd267
MD5 37745c04d500db2db5632f31d1dc58af
BLAKE2b-256 b7105d08d42efe8814aa1febf3b4225ed13ab5dbbb82d2bed74f3d0795c64a71

See more details on using hashes here.

File details

Details for the file ua_stress_engine-1.0.1-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for ua_stress_engine-1.0.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 10974adfe2460276410a8b5ed610eee853ffa5307980884738afbf76da2bee87
MD5 641ef3a7e40c160e300f9c52275e7f67
BLAKE2b-256 74294b9e980cd19ce52fd5fdccbce869bb92e856456d8140bba7e82362a1b0c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page