Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.
Project description
turkic_transliterate Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.
Quick install
- Install Miniconda or Anaconda (recommended).
- Clone the repo and create the environment: conda env create -f env.yml
- Activate the environment: conda activate turkic
- Run the verification tests: python -m pytest (all tests should pass)
Python compatibility • Works on CPython 3.10 and 3.11. • CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.
Package names • Runtime import path: turkic_translit • Distributable name on PyPI: turkic_transliterate • Command-line entry point: turkic-translit
Developer Setup
For the simplest developer setup experience, run the setup script:
python scripts/setup_dev.py
This script will:
- Install the package with all development dependencies
- Set up PyICU on Windows automatically
- Verify that development tools are working properly
Manual Installation
Alternatively, install with pip:
pip install -e .[dev,ui] # add ,winlid on Windows if you need fasttext-wheel
Development Tools
Linux/macOS/Windows with GNU Make
If you have GNU Make installed, you can use the Makefile for common tasks:
make lint # Run linting (ruff, black, mypy)
make format # Auto-format code
make test # Run tests
make web # Launch the web UI
make help # Show all available commands
Windows
Option 1: Install GNU Make using Chocolatey (Recommended)
Install GNU Make using Chocolatey (requires admin privileges):
# In an Admin PowerShell window
choco install make
After installation, you can use the same make commands as on Linux/macOS.
Option 2: Use the PowerShell Script Alternative
If you prefer not to install Chocolatey or GNU Make, use the PowerShell script:
./scripts/run.ps1 lint # Run linting
./scripts/run.ps1 format # Auto-format code
./scripts/run.ps1 test # Run tests
./scripts/run.ps1 web # Launch the web UI
./scripts/run.ps1 help # Show all available commands
Optional extras dev → black, ruff, pytest ui → gradio web demo winlid (Windows only) → fasttext-wheel for language ID
Windows & PyICU
Important: Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:
turkic-pyicu-install
This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.
Command-line usage turkic-translit --lang kk --in text.txt --out_latin kk_lat.txt --ipa --out_ipa kk_ipa.txt --arabic --log-level debug • --lang kk or ky • --ipa emit IPA alongside Latin • --arabic also transliterate embedded Arabic script • --benchmark print throughput statistics • --log-level debug | info | warning | error | critical (default: info)
Logging The central logging setup uses Rich for colour when available. Set TURKIC_LOG_LEVEL or pass --log-level to the CLI. Fallback to standard logging when Rich is absent.
Project Organization
The project is organized into the following directories:
src/turkic_translit/- Core source code for the packageexamples/- Example scripts showing how to use the packageexamples/web/- Web interface for demonstrating transliteration features
data/- Sample data files and language resourcesdocs/- Documentation and reference materialsscripts/- Utility scripts for development and releasescripts/release/- Scripts for building and publishing packages
vendor/pyicu/- Pre-built PyICU wheels for Windowstests/- Test suite for the package
FastText Language Identification Model
This package uses the FastText language identification model (lid.176.bin) for Russian token filtering and language detection. The model file is not included in the repository or pip package due to its large size.
Automatic Download:
- When you use features that require language identification (such as Russian token filtering or the Gradio web demo), the package will automatically download
lid.176.binfrom the official Facebook AI public link if it is not already present. - The file will be saved in the package directory on first use.
No manual action is needed. This ensures compatibility with pip installs, Hugging Face Spaces, and other cloud environments.
If you need to download the model manually, you can do so from: https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Using the Examples
Use the main entry point script to run examples:
python turkic_tools.py [command]
Available commands:
web- Launch the Gradio web interface for real-time transliterationdemo- Run the simple CLI demofull-demo- Run the comprehensive demo with multiple languageshelp- Display available commands
Tokenizer training example turkic-build-spm --input corpora/kk_lat.txt,corpora/ky_lat.txt --model_prefix spm/turkic12k --vocab_size 12000
Filtering Russian tokens from Uzbek cat uz_raw.txt | turkic-filter-russian --mode drop > uz_clean.txt
Developer checklist black . ruff check . pytest -q
All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.
License Apache-2.0
Type-checking
pip install mypy
mypy --strict .
The included mypy.ini restricts analysis to the src/ tree and skips build/, dist/, virtual-env and egg directories so duplicate-module errors do not occur even if you build wheels locally.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turkic_transliterate-0.1.6.tar.gz.
File metadata
- Download URL: turkic_transliterate-0.1.6.tar.gz
- Upload date:
- Size: 48.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ad75a872c5f5e2ff6bb6e0aefe0fde0d60fb852b306d4252e025454ff9ebdc5
|
|
| MD5 |
e34cdded4483609a9fa02a37cf1e1a41
|
|
| BLAKE2b-256 |
0c78e527fcde57b34de466206c30c55274de8acbaa97dd704b11fba3ade35787
|
File details
Details for the file turkic_transliterate-0.1.6-py3-none-any.whl.
File metadata
- Download URL: turkic_transliterate-0.1.6-py3-none-any.whl
- Upload date:
- Size: 41.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e12c297b0e0d6209c3d529c53534fdebc3fbd4b80aa1f47c09464fe058c2fb8b
|
|
| MD5 |
350c7699582798472094b61c76eab3b6
|
|
| BLAKE2b-256 |
fdda30d0c5dcc76ad859aeb45bd91f9dcc361c0ed3e581b3e685a661f1459019
|