Skip to main content

Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.

Project description

turkic_transliterate Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.

Quick install

  1. Install Miniconda or Anaconda (recommended).
  2. Clone the repo and create the environment: conda env create -f env.yml
  3. Activate the environment: conda activate turkic
  4. Run the verification tests: python -m pytest (all tests should pass)

Python compatibility • Works on CPython 3.10 and 3.11. • CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.

Package names • Runtime import path: turkic_translit • Distributable name on PyPI: turkic_transliterate • Command-line entry point: turkic-translit

Developer Setup

For the simplest developer setup experience, run the setup script:

python scripts/setup_dev.py

This script will:

  1. Install the package with all development dependencies
  2. Set up PyICU on Windows automatically
  3. Verify that development tools are working properly

Manual Installation

Alternatively, install with pip:

pip install -e .[dev,ui]        # add ,winlid on Windows if you need fasttext-wheel

Development Tools

Linux/macOS/Windows with GNU Make

If you have GNU Make installed, you can use the Makefile for common tasks:

make lint       # Run linting (ruff, black, mypy)
make format     # Auto-format code
make test       # Run tests
make web        # Launch the web UI
make help       # Show all available commands

Windows

Option 1: Install GNU Make using Chocolatey (Recommended)

Install GNU Make using Chocolatey (requires admin privileges):

# In an Admin PowerShell window
choco install make

After installation, you can use the same make commands as on Linux/macOS.

Option 2: Use the PowerShell Script Alternative

If you prefer not to install Chocolatey or GNU Make, use the PowerShell script:

./scripts/run.ps1 lint       # Run linting
./scripts/run.ps1 format     # Auto-format code
./scripts/run.ps1 test       # Run tests
./scripts/run.ps1 web        # Launch the web UI
./scripts/run.ps1 help       # Show all available commands

Optional extras dev → black, ruff, pytest ui → gradio web demo winlid (Windows only) → fasttext-wheel for language ID

Windows & PyICU

Important: Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:

turkic-pyicu-install

This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.

Command-line usage turkic-translit --lang kk --in text.txt --out_latin kk_lat.txt --ipa --out_ipa kk_ipa.txt --arabic --log-level debug • --lang kk or ky • --ipa emit IPA alongside Latin • --arabic also transliterate embedded Arabic script • --benchmark print throughput statistics • --log-level debug | info | warning | error | critical (default: info)

Logging The central logging setup uses Rich for colour when available. Set TURKIC_LOG_LEVEL or pass --log-level to the CLI. Fallback to standard logging when Rich is absent.

Project Organization

The project is organized into the following directories:

  • src/turkic_translit/ - Core source code for the package
  • examples/ - Example scripts showing how to use the package
    • examples/web/ - Web interface for demonstrating transliteration features
  • data/ - Sample data files and language resources
  • docs/ - Documentation and reference materials
  • scripts/ - Utility scripts for development and release
    • scripts/release/ - Scripts for building and publishing packages
  • vendor/pyicu/ - Pre-built PyICU wheels for Windows
  • tests/ - Test suite for the package

Using the Examples

Use the main entry point script to run examples:

python turkic_tools.py [command]

Available commands:

  • web - Launch the Gradio web interface for real-time transliteration
  • demo - Run the simple CLI demo
  • full-demo - Run the comprehensive demo with multiple languages
  • help - Display available commands

Tokenizer training example turkic-build-spm --input corpora/kk_lat.txt,corpora/ky_lat.txt --model_prefix spm/turkic12k --vocab_size 12000

Filtering Russian tokens from Uzbek cat uz_raw.txt | turkic-filter-russian --mode drop > uz_clean.txt

Developer checklist black . ruff check . pytest -q

All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.

License Apache-2.0

Type-checking

pip install mypy
mypy --strict .

The included mypy.ini restricts analysis to the src/ tree and skips build/, dist/, virtual-env and egg directories so duplicate-module errors do not occur even if you build wheels locally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkic_transliterate-0.1.3.tar.gz (40.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkic_transliterate-0.1.3-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file turkic_transliterate-0.1.3.tar.gz.

File metadata

  • Download URL: turkic_transliterate-0.1.3.tar.gz
  • Upload date:
  • Size: 40.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for turkic_transliterate-0.1.3.tar.gz
Algorithm Hash digest
SHA256 6725d24acf636ab448e9b2c7e8916a6c762ab7b77a8d029d036b970200f9ec2f
MD5 d98fefb5969dfac185abb5d5a4b1cb42
BLAKE2b-256 b1c9fb1b832b7aa1d127157b4d00689fe44f062f71e6f1b93acbe82d1680622c

See more details on using hashes here.

File details

Details for the file turkic_transliterate-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for turkic_transliterate-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 75f1b1981734e3f9b4c279146e9879b8a9a3b01baa205fb8645c8d0fcff97ee5
MD5 94959b98b2a6cc02071869c7a951dd30
BLAKE2b-256 1d7bbedc19dda59372cd42c45fe53b7a17c2956792b6faf63b337ef53ae2426b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page