Skip to main content

Yakut language text normalizer using Word2Vec embeddings

Project description

Yakit - Yakut Language Text Normalizer

A Python library for normalizing Yakut (Sakha) language text using Word2Vec embeddings.

Installation

pip install yakit

For automatic model downloading from Hugging Face Hub:

pip install yakit[download]

Quick Start

from yakit.normalizers import Word2VecNormalizer

# Initialize normalizer (auto-downloads model on first use)
normalizer = Word2VecNormalizer()

# Normalize text
text = "Мин сахалыы билэбин"
normalized = normalizer.normalize(text)
print(normalized)

Custom Model Path

If you have your own Word2Vec model:

from yakit.normalizers import Word2VecNormalizer

normalizer = Word2VecNormalizer(
    word2vec_path="/path/to/your/model.bin",
    training_data_path="/path/to/train_pairs.txt"  # optional
)

Command Line Interface

# Normalize text directly
yakit normalize "Мин сахалыы билэбин"

# Normalize a file
yakit normalize -i input.txt -o output.txt

# Download models manually
yakit download

# Show cache info
yakit info

What is Normalization?

Normalization converts text WITHOUT special Yakut characters to text WITH proper Yakut characters:

Input Output
h → һ
г → ҕ (in certain positions)
н → ҥ (in certain positions)
о → ө (in certain positions)
у → ү (in certain positions)

Performance

With optimized hyperparameters:

  • Character Accuracy: 97.15%
  • Word Accuracy: 92.09%
  • Exact Match: 61.77%

Requirements

  • Python 3.10–3.13 (3.14 not yet supported: gensim has no compatible build)
  • gensim
  • numpy
  • tqdm

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yakit-0.1.1.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yakit-0.1.1-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file yakit-0.1.1.tar.gz.

File metadata

  • Download URL: yakit-0.1.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for yakit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 899ef08b17433b3b75a5fa92ce0e9c9c25490fdddd3295a173eb6423a6bd69d5
MD5 8ee91338ef1296b126cf3f8a48c45e00
BLAKE2b-256 382bcb3b5486ef0814a36acb5b01544d998e2f89811a534ad8d284d6ae06c582

See more details on using hashes here.

File details

Details for the file yakit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: yakit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for yakit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f3244e8d7dd001914416731ad0fe9aecc0190ebea9854205697934e3eb60ea9a
MD5 ee6ef0dc245ed122f32db8e995ee74a4
BLAKE2b-256 852de21a516e40727dddc083818c29685471d8f56086a3d64484c8bd201dbfac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page