Skip to main content

Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.

Project description

turkic_transliterate Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.

Quick install

  1. Install Miniconda or Anaconda (recommended).
  2. Clone the repo and create the environment: conda env create -f env.yml
  3. Activate the environment: conda activate turkic
  4. Run the verification tests: python -m pytest (all tests should pass)

Python compatibility • Works on CPython 3.10 and 3.11. • CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.

Package names • Runtime import path: turkic_translit • Distributable name on PyPI: turkic_transliterate • Command-line entry point: turkic-translit

Developer Setup

For the simplest developer setup experience, run the setup script:

python scripts/setup_dev.py

This script will:

  1. Install the package with all development dependencies
  2. Set up PyICU on Windows automatically
  3. Verify that development tools are working properly

Manual Installation

Alternatively, install with pip:

pip install -e .[dev,ui]        # add ,winlid on Windows if you need fasttext-wheel

Development Tools

Linux/macOS/Windows with GNU Make

If you have GNU Make installed, you can use the Makefile for common tasks:

make lint       # Run linting (ruff, black, mypy)
make format     # Auto-format code
make test       # Run tests
make web        # Launch the web UI
make help       # Show all available commands

Windows

Option 1: Install GNU Make using Chocolatey (Recommended)

Install GNU Make using Chocolatey (requires admin privileges):

# In an Admin PowerShell window
choco install make

After installation, you can use the same make commands as on Linux/macOS.

Option 2: Use the PowerShell Script Alternative

If you prefer not to install Chocolatey or GNU Make, use the PowerShell script:

./scripts/run.ps1 lint       # Run linting
./scripts/run.ps1 format     # Auto-format code
./scripts/run.ps1 test       # Run tests
./scripts/run.ps1 web        # Launch the web UI
./scripts/run.ps1 help       # Show all available commands

Optional extras dev → black, ruff, pytest ui → gradio web demo winlid (Windows only) → fasttext-wheel for language ID

Windows & PyICU

Important: Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:

turkic-pyicu-install

This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.

Command-line usage turkic-translit --lang kk --in text.txt --out_latin kk_lat.txt --ipa --out_ipa kk_ipa.txt --arabic --log-level debug • --lang kk or ky • --ipa emit IPA alongside Latin • --arabic also transliterate embedded Arabic script • --benchmark print throughput statistics • --log-level debug | info | warning | error | critical (default: info)

Logging The central logging setup uses Rich for colour when available. Set TURKIC_LOG_LEVEL or pass --log-level to the CLI. Fallback to standard logging when Rich is absent.

Project Organization

The project is organized into the following directories:

  • src/turkic_translit/ - Core source code for the package
  • examples/ - Example scripts showing how to use the package
    • examples/web/ - Web interface for demonstrating transliteration features
  • data/ - Sample data files and language resources
  • docs/ - Documentation and reference materials
  • scripts/ - Utility scripts for development and release
    • scripts/release/ - Scripts for building and publishing packages
  • vendor/pyicu/ - Pre-built PyICU wheels for Windows
  • tests/ - Test suite for the package

FastText Language Identification Model

This package uses the FastText language identification model (lid.176.bin) for Russian token filtering and language detection. The model file is not included in the repository or pip package due to its large size.

Automatic Download:

  • When you use features that require language identification (such as Russian token filtering or the Gradio web demo), the package will automatically download lid.176.bin from the official Facebook AI public link if it is not already present.
  • The file will be saved in the package directory on first use.

No manual action is needed. This ensures compatibility with pip installs, Hugging Face Spaces, and other cloud environments.

If you need to download the model manually, you can do so from: https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Using the Examples

Use the main entry point script to run examples:

python turkic_tools.py [command]

Available commands:

  • web - Launch the Gradio web interface for real-time transliteration
  • demo - Run the simple CLI demo
  • full-demo - Run the comprehensive demo with multiple languages
  • help - Display available commands

Tokenizer training example turkic-build-spm --input corpora/kk_lat.txt,corpora/ky_lat.txt --model_prefix spm/turkic12k --vocab_size 12000

Filtering Russian tokens from Uzbek cat uz_raw.txt | turkic-filter-russian --mode drop > uz_clean.txt

Developer checklist black . ruff check . pytest -q

All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.

License Apache-2.0

Type-checking

pip install mypy
mypy --strict .

The included mypy.ini restricts analysis to the src/ tree and skips build/, dist/, virtual-env and egg directories so duplicate-module errors do not occur even if you build wheels locally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkic_transliterate-0.1.6.tar.gz (48.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkic_transliterate-0.1.6-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file turkic_transliterate-0.1.6.tar.gz.

File metadata

  • Download URL: turkic_transliterate-0.1.6.tar.gz
  • Upload date:
  • Size: 48.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for turkic_transliterate-0.1.6.tar.gz
Algorithm Hash digest
SHA256 0ad75a872c5f5e2ff6bb6e0aefe0fde0d60fb852b306d4252e025454ff9ebdc5
MD5 e34cdded4483609a9fa02a37cf1e1a41
BLAKE2b-256 0c78e527fcde57b34de466206c30c55274de8acbaa97dd704b11fba3ade35787

See more details on using hashes here.

File details

Details for the file turkic_transliterate-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for turkic_transliterate-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 e12c297b0e0d6209c3d529c53534fdebc3fbd4b80aa1f47c09464fe058c2fb8b
MD5 350c7699582798472094b61c76eab3b6
BLAKE2b-256 fdda30d0c5dcc76ad859aeb45bd91f9dcc361c0ed3e581b3e685a661f1459019

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page