Ukrainian NLP backend for word stress prediction — Luscinia lightgbm model (99.44 % accuracy) with ONNX browser export.

These details have not been verified by PyPI

Project links

Project description

ua-stress-engine

Ukrainian word stress engine — dictionary lookup with full IPA transcription, ML stress prediction, and published packages for Python, Node.js, and the browser.

The centrepiece is Luscinia — a LightGBM model that predicts the stressed vowel in any Ukrainian word with 99.44 % accuracy across all syllable counts. The model is also exported to ONNX for browser-side inference via onnxruntime-web.

Published packages

Package	Registry	Source	Description
`ua-word-stress`	npm	`packages/ua-stress-web/`	Zero-dependency TypeScript trie (~9 MB, browser + Node)
`ua-word-stress-wasm`	npm	`crates/wasm/`	Rust/WASM — full IPA, morphology, batch API
`ua-stress-ml`	npm	`packages/ua-stress-ml/`	ONNX Luscinia predictor for OOV words (browser/worker)
`ua-stress-engine`	PyPI	`crates/python/`	PyO3 extension (`ukrainian_stress`) — same API as WASM
`luscinia`	PyPI	`packages/luscinia/`	Python ONNX Luscinia predictor (OOV fallback)

Highlights


Model	`luscinia-lgbm-str-ua-univ-v1`
Task	Ukrainian word stress prediction (multiclass, vowel-ordinal)
Accuracy	99.44 % (sanity sample) · 192 / 197 hand-checked
Syllable coverage	2 – 10 + syllable words, single universal model
Features	132 linguistic / hash features
Runtimes	lightgbm (Python) · ONNX (browser via `onnxruntime-web`)
Training data	2.875 M word forms
License	AGPL-3.0

Installation

JavaScript / TypeScript

# Trie-based lookup (browser + Node, no WASM)
npm install ua-word-stress

# Full engine — IPA, morphology, batch API (WASM, bundler required)
npm install ua-word-stress-wasm

Python

Published packages:

pip install ua-stress-engine
pip install luscinia

The package is a compiled Rust extension (PyO3 + maturin). Runtime Python dependencies:

Extra	Packages	Purpose
(core)	none	Dictionary lookup + IPA via Rust extension
`ml`	`lightgbm>=4.0`, `numpy>=1.24`	Luscinia LightGBM resolver
`nlp`	`spacy>=3.7`	spaCy tokenization pipeline
`full`	all of the above	Everything

For local development (requires Rust toolchain and maturin):

pip install maturin
pip install -e '.[full]'   # builds the Rust extension in-place

Quick start — JavaScript / TypeScript

Trie-based lookup (`ua-word-stress`)

Pure TypeScript, no WASM, works in any environment:

import { UaStressTrie } from "ua-word-stress";

const trie = new UaStressTrie();
trie.mark("університет"); // → 'університе́т'
trie.lookup("замок"); // → 0 (first syllable — замок-lock)
trie.markBatch(["мама", "тато"]); // → ['ма́ма', 'та́то']

Full WASM engine (`ua-word-stress-wasm`)

Rust/WASM with IPA transcription, morphology, and batch API. No init() call needed — the dictionary loads automatically at module import (bundler target):

import { mark, lookup, stressIndex, lookupMany, markMany, transcribe } from "ua-word-stress-wasm";

mark("університет"); // → 'університе́т'
stressIndex("мама"); // → 0  (0-based syllable index)

const r = lookup("замок");
r.readings[0].stressedForm; // → 'за́мок'
r.readings[0].ipa; // → 'zɑmɔk'
r.readings[0].syllableIndex; // → 0
r.readings[1].stressedForm; // → 'замо́к'  (heteronym)

transcribe("слово", 0); // → { ipa: 'slɔwɔ', ipaSyllables: ['ˈslɔ', 'wɔ'], … }

const batch = lookupMany(["мама", "замок", "xyz"]);
const marked = markMany(["мама", "тато", "xyz"]);

See the WASM package README for the full API reference.

Quick start — Python

Dictionary + IPA (Rust extension)

import ukrainian_stress

ukrainian_stress.mark('університет')   # → 'університе́т'

r = ukrainian_stress.lookup('замок')
r['readings'][0]['stressed_form']      # → 'за́мок'
r['readings'][0]['ipa']               # → 'zɑmɔk'
r['readings'][0]['syllable_index']    # → 0

batch = ukrainian_stress.lookup_many(['мама', 'університет', 'xyz'])
marks = ukrainian_stress.mark_many(['мама', 'тато', 'xyz'])

Full pipeline (Rust dict + LightGBM ML fallback)

Requires pip install -e '.[full]':

from src.stress_resolver.resolver_factory import create_pipeline_kwargs
from src.stress_resolver.pipeline import UkrainianPipeline

pipeline = UkrainianPipeline(**create_pipeline_kwargs())

doc = pipeline.process("Мама варила борщ на кухні.")
for sentence in doc.sentences:
    for token in sentence.tokens:
        print(f"{token.text:15} {token.stress_pattern}")

Raw Luscinia prediction (LightGBM)

import lightgbm as lgb
import numpy as np
from src.stress_prediction.lightgbm.services.feature_service_universal import (
    build_features_universal,
)

MODEL_PATH = (
    "src/stress_prediction/lightgbm/artifacts/"
    "luscinia-lgbm-str-ua-univ-v1/P3_0017_FINAL_FULLDATA/P3_0017_full.lgb"
)
bst = lgb.Booster(model_file=MODEL_PATH)

VOWELS = set("аеєиіїоуюя")

def predict_stress(word: str, pos: str = "NOUN") -> str:
    feat = build_features_universal(word, pos)
    X = np.array(list(feat.values()), dtype=np.float32).reshape(1, -1)
    vowel_idx = int(bst.predict(X).argmax(axis=1)[0])
    vpos = [i for i, c in enumerate(word.lower()) if c in VOWELS]
    cp = vpos[vowel_idx]
    return word[: cp + 1] + "\u0301" + word[cp + 1 :]

print(predict_stress("університет", "NOUN"))  # → університе́т
print(predict_stress("читати",      "VERB"))  # → чита́ти

POS tags — use Universal Dependencies tags: NOUN VERB ADJ ADV PRON DET NUM PART CCONJ X. Pass "X" when POS is unknown.

Quick start — browser (ONNX)

The 30 MB gzip-compressed ONNX artifact (P3_0017_full.onnx.gz) is stored in Git LFS. Serve it with Content-Encoding: gzip so browsers decompress it transparently.

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create(
  "/models/P3_0017_full.onnx.gz",
);

// Build a Float32Array of 132 features (see manifest.json for order)
const tensor = new ort.Tensor("float32", featureArray, [1, 132]);
const results = await session.run({ float_input: tensor });
const vowelIndex = Number(results["label"].data[0]);

See src/stress_prediction/lightgbm/documentation/LUSCINIA_LGBM_V1_DEPLOYMENT.md for the full deployment guide (nginx / Express serving, batch inference, feature order).

API status

Canonical API contract is documented in documentation/API_DESIGN.md. Current runtime/API mapping:

ua-word-stress (npm): trie lookup API (lookup, lookupFull, mark, batch methods)
ua-word-stress-wasm (npm): full Rust API (lookup, lookupBatch/lookupMany, mark, markBatch/markMany, stressIndex, transcribe)
ua-stress-engine (PyPI): Python binding module ukrainian_stress incl. batch (lookup_many, mark_many)
ua-stress-ml (npm) and luscinia (PyPI): ML OOV fallback predictors (132-feature Luscinia ONNX model)

Modules

Module	Path	What it does
`ua-word-stress` (npm)	`packages/ua-stress-web/`	Zero-dependency TypeScript trie — `mark`, `lookup`, batch API
`ua-word-stress-wasm` (npm)	`crates/wasm/`	Rust/WASM — IPA, morphology, batch API, no init() required
`ua-stress-ml` (npm)	`packages/ua-stress-ml/`	Browser/worker ONNX Luscinia predictor for OOV stress
`ukrainian_stress` (Python)	`crates/python/`	PyO3 extension — same API as WASM, for Python
`luscinia` (Python)	`packages/luscinia/`	Python ONNX Luscinia predictor package (PyPI)
Rust core	`crates/core/`	Dictionary embed, phonetic pipeline, syllabifier (shared by WASM + Python)
ML resolver (LightGBM)	`src/stress_prediction/lightgbm/`	Luscinia model — 99.44 % accuracy, 132 features, ONNX export
NLP pipeline	`src/stress_resolver/`	spaCy tokenization → Rust dict lookup → ML fallback
Data management	`src/data_management/`	Source parsers, master SQLite DB builder, binary trie exporter

Project structure

ua-stress-engine/
├── crates/
│   ├── core/                      # Rust core library (dict embed, phonetics, syllabifier)
│   ├── wasm/                      # ua-word-stress-wasm (wasm-pack, bundler target)
│   │   ├── src/lib.rs
│   │   └── pkg/                   # built npm package (gitignored except README + package.json)
│   ├── python/                    # ukrainian_stress PyO3 extension (maturin)
│   │   └── src/lib.rs
│   └── builder/                   # CLI tool to compile the embedded binary dictionary
├── packages/
│   └── ua-stress-web/             # ua-word-stress npm package (TypeScript, zero deps)
│       ├── src/                   # UaStressTrie.ts, types.ts, utils.ts
│       ├── tests/
│       └── package.json
├── src/
│   ├── stress_resolver/           # Python NLP pipeline + resolver chain
│   │   ├── pipeline.py            # UkrainianPipeline
│   │   ├── stress_resolver.py     # Rust-extension-based resolver
│   │   ├── ml_stress_resolver.py  # LightGBM-based resolver
│   │   └── resolver_factory.py    # Auto-configure resolver chain
│   ├── nlp/
│   │   ├── stress_service/        # Stress lookup wrapper
│   │   ├── phonetic/              # IPA transcription (Python side)
│   │   └── tokenization_service/  # spaCy tokenizer wrapper
│   ├── stress_prediction/
│   │   └── lightgbm/              # Luscinia model, training scripts, services, artifacts
│   └── data_management/
│       ├── sources/               # Source parsers (kaikki, trie, txt, variative)
│       ├── transform/             # Master DB builder (SQLite)
│       └── export/
│           └── web_stress_db/     # Binary .ctrie builder → packages/ua-stress-web/data/
├── build_master_db.py             # Build master SQLite from all sources
├── build_web_stress_db.py         # Build + export binary trie
├── pyproject.toml                 # maturin build config (points to crates/python/)
└── tests/
    └── src/
        ├── stress_resolver/       # Pipeline + resolver tests
        ├── stress_prediction/     # LightGBM model tests
        ├── data_management/       # Source parser + DB tests
        └── nlp/                   # Stress service tests

Data sources

The embedded dictionary is compiled from five open Ukrainian stress resources:

Source	License	Entries	Notes
kaikki.org Ukrainian — Wiktionary extract	CC BY-SA 4.0	~2 M inflected forms	POS + full morphology
lang-uk/ukrainian-word-stress — marisa-trie	MIT	~2.9 M word forms	compact trie with morph tags
lang-uk/ukrainian-word-stress-dictionary — text dict	see upstream	~2.9 M word forms	based on ULIF / NASU corpora
bakustarver/ukr-dictionaries-list-opensource — SUM11 DiktJson (`ukr-ukr_SUM-11_or_1`)	public domain (original SUM-11), digitised JSON	~127 K lemmas	classic 11-volume explanatory dictionary
`ua_variative_stressed_words` — curated free-variant list	original work	~150 lemmas	marks freely variable stress

All five sources are merged into a single master SQLite (~680 MB) and then compiled into the embedded binary (ua_stress.bin.bz2) shipped inside the Rust crates.

Running tests

# Python tests (requires ml + nlp extras installed)
python -m pytest tests/ -q

# TypeScript trie package
cd packages/ua-stress-web && pnpm test

# WASM package
cd crates/wasm && wasm-pack test --node

Large files (Git LFS)

The following binary artifacts are stored in Git LFS:

File	Size
`P3_0017_full.lgb`	259 MB
`P3_0017_full.onnx`	185 MB
`P3_0017_full.onnx.gz`	30 MB
`stress.lmdb`	varies

License

AGPL-3.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Jun 1, 2026

1.0.1

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ua_stress_engine-1.0.2.tar.gz (15.5 MB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ua_stress_engine-1.0.2-cp313-cp313-win_amd64.whl (31.1 MB view details)

Uploaded Jun 1, 2026 CPython 3.13Windows x86-64

File details

Details for the file ua_stress_engine-1.0.2.tar.gz.

File metadata

Download URL: ua_stress_engine-1.0.2.tar.gz
Upload date: Jun 1, 2026
Size: 15.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.13.1

File hashes

Hashes for ua_stress_engine-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`c1a16f359878fb07b45a6a9c592dfa5d0e1e4d7e166c3334317ccfb10725d191`
MD5	`bec6962790d570b20c4cafce07dce96a`
BLAKE2b-256	`6150f9197e7fc23b072d9dbc91aba09ac03686d75638ae5b6265c60a26cad61a`

See more details on using hashes here.

File details

Details for the file ua_stress_engine-1.0.2-cp313-cp313-win_amd64.whl.

File metadata

Download URL: ua_stress_engine-1.0.2-cp313-cp313-win_amd64.whl
Upload date: Jun 1, 2026
Size: 31.1 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.13.1

File hashes

Hashes for ua_stress_engine-1.0.2-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`ff258910b235e74c3a2ff0ce41a1e2250255067ea8c596ca9816eb086f926ab8`
MD5	`e82d2c8f818af3831a61209cb2d08ee9`
BLAKE2b-256	`d6445dc741cb71852d0db79acab38d9d186d85e009b7a096ed31bc54b2647c39`

See more details on using hashes here.

ua-stress-engine 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ua-stress-engine

Published packages

Highlights

Installation

JavaScript / TypeScript

Python

Quick start — JavaScript / TypeScript

Trie-based lookup (ua-word-stress)

Full WASM engine (ua-word-stress-wasm)

Quick start — Python

Dictionary + IPA (Rust extension)

Full pipeline (Rust dict + LightGBM ML fallback)

Raw Luscinia prediction (LightGBM)

Quick start — browser (ONNX)

API status

Modules

Project structure

Data sources

Running tests

Large files (Git LFS)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Trie-based lookup (`ua-word-stress`)

Full WASM engine (`ua-word-stress-wasm`)