Skip to main content

Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.

Project description

turkic_transliterate Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.

Quick install

  1. Install Miniconda or Anaconda (recommended).
  2. Clone the repo and create the environment: conda env create -f env.yml
  3. Activate the environment: conda activate turkic
  4. Run the verification tests: python -m pytest (all tests should pass)

Python compatibility • Works on CPython 3.10 and 3.11. • CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.

Package names • Runtime import path: turkic_translit • Distributable name on PyPI: turkic_transliterate • Command-line entry point: turkic-translit

Installing with pip pip install -e .[dev,ui] # add ,winlid on Windows if you need fasttext-wheel

Optional extras dev → black, ruff, pytest ui → gradio web demo winlid (Windows only) → fasttext-wheel for language ID

Windows & PyICU

Important: Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:

python scripts/get_pyicu_wheel.py

This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.

Command-line usage turkic-translit --lang kk --in text.txt --out_latin kk_lat.txt --ipa --out_ipa kk_ipa.txt --arabic --log-level debug • --lang kk or ky • --ipa emit IPA alongside Latin • --arabic also transliterate embedded Arabic script • --benchmark print throughput statistics • --log-level debug | info | warning | error | critical (default: info)

Logging The central logging setup uses Rich for colour when available. Set TURKIC_LOG_LEVEL or pass --log-level to the CLI. Fallback to standard logging when Rich is absent.

Web demo python web_demo.py Opens a local Gradio interface for real-time transliteration.

Tokenizer training example python scripts/build_spm.py --input corpora/kk_lat.txt,corpora/ky_lat.txt --model_prefix spm/turkic12k --vocab_size 12000

Filtering Russian tokens from Uzbek cat uz_raw.txt | python scripts/filter_russian.py --mode drop > uz_clean.txt

Developer checklist black . ruff check . pytest -q

All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.

License Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkic_transliterate-0.1.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkic_transliterate-0.1.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file turkic_transliterate-0.1.0.tar.gz.

File metadata

  • Download URL: turkic_transliterate-0.1.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for turkic_transliterate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c1f613edf4accaf4c1fd7203d5667447e87b6c1f1f6a08a830505cb81637423d
MD5 a04e32c131f96bd85259d738ac749fdf
BLAKE2b-256 0898ffdd815b1da3840b290a1d216ce0b1c24cf84bd4e7f4548991e5907d3c3a

See more details on using hashes here.

File details

Details for the file turkic_transliterate-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for turkic_transliterate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 42777dadfca82f5a4329de10f0bdc0fc9cd832a6a63e02a4d266417c38675237
MD5 8c0fdf8e6698af748576dd0c86cdf803
BLAKE2b-256 246e1e58077c81c1566fe4e91c1c6edc7987b1d050c24e6ad13904187c4c1545

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page