Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.
Project description
turkic_transliterate Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.
Quick install
- Install Miniconda or Anaconda (recommended).
- Clone the repo and create the environment: conda env create -f env.yml
- Activate the environment: conda activate turkic
- Run the verification tests: python -m pytest (all tests should pass)
Python compatibility • Works on CPython 3.10 and 3.11. • CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.
Package names • Runtime import path: turkic_translit • Distributable name on PyPI: turkic_transliterate • Command-line entry point: turkic-translit
Installing with pip pip install -e .[dev,ui] # add ,winlid on Windows if you need fasttext-wheel
Optional extras dev → black, ruff, pytest ui → gradio web demo winlid (Windows only) → fasttext-wheel for language ID
Windows & PyICU
Important: Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:
python scripts/get_pyicu_wheel.py
This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.
Command-line usage turkic-translit --lang kk --in text.txt --out_latin kk_lat.txt --ipa --out_ipa kk_ipa.txt --arabic --log-level debug • --lang kk or ky • --ipa emit IPA alongside Latin • --arabic also transliterate embedded Arabic script • --benchmark print throughput statistics • --log-level debug | info | warning | error | critical (default: info)
Logging The central logging setup uses Rich for colour when available. Set TURKIC_LOG_LEVEL or pass --log-level to the CLI. Fallback to standard logging when Rich is absent.
Web demo python web_demo.py Opens a local Gradio interface for real-time transliteration.
Tokenizer training example python scripts/build_spm.py --input corpora/kk_lat.txt,corpora/ky_lat.txt --model_prefix spm/turkic12k --vocab_size 12000
Filtering Russian tokens from Uzbek cat uz_raw.txt | python scripts/filter_russian.py --mode drop > uz_clean.txt
Developer checklist black . ruff check . pytest -q
All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.
License Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turkic_transliterate-0.1.1.tar.gz.
File metadata
- Download URL: turkic_transliterate-0.1.1.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a70c837cc8078b9dcf6b2a526bae80d45337095a7f84002b8e3f41ca284666fc
|
|
| MD5 |
cf4747f056d742ca46336b840578df2a
|
|
| BLAKE2b-256 |
25729bd8448f6aa017e67e285256f8487cb6d816d3f7abb473eefb21520c2dc6
|
File details
Details for the file turkic_transliterate-0.1.1-py3-none-any.whl.
File metadata
- Download URL: turkic_transliterate-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcc0065052f58aa6d3d171009cc4cbe2539365ace876a5ca50541bb54330f74b
|
|
| MD5 |
8be390245fee73750515cccc80a95183
|
|
| BLAKE2b-256 |
0e136c10768613db35ca10a05f9930e1a5dda46c86ca053d0869bedec180d86f
|