Yakut language text normalizer using Word2Vec embeddings
Project description
Yakit - Yakut Language Text Normalizer
A Python library for normalizing Yakut (Sakha) language text using Word2Vec embeddings.
Installation
pip install yakit
For automatic model downloading from Hugging Face Hub:
pip install yakit[download]
Quick Start
from yakit.normalizers import Word2VecNormalizer
# Initialize normalizer (auto-downloads model on first use)
normalizer = Word2VecNormalizer()
# Normalize text
text = "Мин сахалыы билэбин"
normalized = normalizer.normalize(text)
print(normalized)
Custom Model Path
If you have your own Word2Vec model:
from yakit.normalizers import Word2VecNormalizer
normalizer = Word2VecNormalizer(
word2vec_path="/path/to/your/model.bin",
training_data_path="/path/to/train_pairs.txt" # optional
)
Command Line Interface
# Normalize text directly
yakit normalize "Мин сахалыы билэбин"
# Normalize a file
yakit normalize -i input.txt -o output.txt
# Download models manually
yakit download
# Show cache info
yakit info
What is Normalization?
Normalization converts text WITHOUT special Yakut characters to text WITH proper Yakut characters:
| Input | Output |
|---|---|
| h → | һ |
| г → | ҕ (in certain positions) |
| н → | ҥ (in certain positions) |
| о → | ө (in certain positions) |
| у → | ү (in certain positions) |
Performance
With optimized hyperparameters:
- Character Accuracy: 97.15%
- Word Accuracy: 92.09%
- Exact Match: 61.77%
Requirements
- Python 3.10–3.13 (3.14 not yet supported: gensim has no compatible build)
- gensim
- numpy
- tqdm
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yakit-0.1.1.tar.gz.
File metadata
- Download URL: yakit-0.1.1.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
899ef08b17433b3b75a5fa92ce0e9c9c25490fdddd3295a173eb6423a6bd69d5
|
|
| MD5 |
8ee91338ef1296b126cf3f8a48c45e00
|
|
| BLAKE2b-256 |
382bcb3b5486ef0814a36acb5b01544d998e2f89811a534ad8d284d6ae06c582
|
File details
Details for the file yakit-0.1.1-py3-none-any.whl.
File metadata
- Download URL: yakit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3244e8d7dd001914416731ad0fe9aecc0190ebea9854205697934e3eb60ea9a
|
|
| MD5 |
ee6ef0dc245ed122f32db8e995ee74a4
|
|
| BLAKE2b-256 |
852de21a516e40727dddc083818c29685471d8f56086a3d64484c8bd201dbfac
|