Skip to main content

Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.

Project description

turkic_transliterate Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.

Quick install

  1. Install Miniconda or Anaconda (recommended).
  2. Clone the repo and create the environment: conda env create -f env.yml
  3. Activate the environment: conda activate turkic
  4. Run the verification tests: python -m pytest (all tests should pass)

Python compatibility • Works on CPython 3.10 and 3.11. • CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.

Package names • Runtime import path: turkic_translit • Distributable name on PyPI: turkic_transliterate • Command-line entry point: turkic-translit

Installing with pip pip install -e .[dev,ui] # add ,winlid on Windows if you need fasttext-wheel

Optional extras dev → black, ruff, pytest ui → gradio web demo winlid (Windows only) → fasttext-wheel for language ID

Windows & PyICU

Important: Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:

python scripts/get_pyicu_wheel.py

This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.

Command-line usage turkic-translit --lang kk --in text.txt --out_latin kk_lat.txt --ipa --out_ipa kk_ipa.txt --arabic --log-level debug • --lang kk or ky • --ipa emit IPA alongside Latin • --arabic also transliterate embedded Arabic script • --benchmark print throughput statistics • --log-level debug | info | warning | error | critical (default: info)

Logging The central logging setup uses Rich for colour when available. Set TURKIC_LOG_LEVEL or pass --log-level to the CLI. Fallback to standard logging when Rich is absent.

Web demo python web_demo.py Opens a local Gradio interface for real-time transliteration.

Tokenizer training example python scripts/build_spm.py --input corpora/kk_lat.txt,corpora/ky_lat.txt --model_prefix spm/turkic12k --vocab_size 12000

Filtering Russian tokens from Uzbek cat uz_raw.txt | python scripts/filter_russian.py --mode drop > uz_clean.txt

Developer checklist black . ruff check . pytest -q

All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.

License Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkic_transliterate-0.1.1.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkic_transliterate-0.1.1-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file turkic_transliterate-0.1.1.tar.gz.

File metadata

  • Download URL: turkic_transliterate-0.1.1.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for turkic_transliterate-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a70c837cc8078b9dcf6b2a526bae80d45337095a7f84002b8e3f41ca284666fc
MD5 cf4747f056d742ca46336b840578df2a
BLAKE2b-256 25729bd8448f6aa017e67e285256f8487cb6d816d3f7abb473eefb21520c2dc6

See more details on using hashes here.

File details

Details for the file turkic_transliterate-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for turkic_transliterate-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bcc0065052f58aa6d3d171009cc4cbe2539365ace876a5ca50541bb54330f74b
MD5 8be390245fee73750515cccc80a95183
BLAKE2b-256 0e136c10768613db35ca10a05f9930e1a5dda46c86ca053d0869bedec180d86f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page