Skip to main content

Fast Unicode transliteration (including CJK), slugification, and text normalization — Rust-powered Python library

Project description

translit

Documentation License: MIT

Unicode text infrastructure for Python: transliteration, normalization, and safety analysis, powered by Rust.

Documentation | API Reference | PyPI

Features

All text processing is implemented in Rust with O(1) PHF lookups and exposed to Python via PyO3.

Installation

pip install translit-rs

The package installs as translit-rs on PyPI but imports as translit:

import translit  # not translit_rs

Requires Python 3.9+. Wheels are available for Linux, macOS, and Windows.

Quick start

from translit import transliterate, slugify, sanitize_filename

# Latin/Cyrillic/Greek
transliterate("café")          # → "cafe"
transliterate("Москва")        # → "Moskva"
transliterate("Ünïcödé")       # → "Unicode"

# Chinese (Hanzi → Pinyin)
transliterate("北京市")         # → "bei jing shi"
slugify("北京烤鸭")            # → "bei-jing-kao-ya"

# Korean (Hangul → Revised Romanization)
transliterate("서울")           # → "seo ul"
slugify("대한민국")            # → "dae-han-min-gug"

# Japanese (Hiragana/Katakana → Hepburn)
transliterate("ひらがな")       # → "hiragana"
transliterate("カタカナ")       # → "katakana"

# Language-specific transliteration
transliterate("Ärger", lang="de")  # → "Aerger"
transliterate("Київ", lang="uk")   # → "Kyiv"

# Auto-detect language from script
transliterate("Москва", lang="auto")  # → "Moskva" (detects Cyrillic → Russian)
transliterate("ภาษาไทย", lang="auto")  # → Thai transliteration (detects Thai)

# Reverse transliteration (Latin → native script)
transliterate("Moskva", target="ru")   # → "Москва"
transliterate("Athina", target="el")   # → "Αθηνα"

# Slugification
slugify("Hello World!")            # → "hello-world"
slugify("café au lait")           # → "cafe-au-lait"

# Filename sanitization
sanitize_filename("my file<>.txt")         # → "my_file.txt"
sanitize_filename("CON.txt")               # → "_CON.txt"
sanitize_filename("../../etc/passwd")      # → ".etc_passwd"

CJK transliteration

Chinese characters are mapped to toneless pinyin from the Unicode Unihan kMandarin field, covering the full CJK Unified Ideographs block (U+4E00–U+9FFF, 20,924 characters). Korean Hangul syllables are algorithmically decomposed into jamo and romanized using the Revised Romanization standard (all 11,172 precomposed syllables). Japanese hiragana and katakana use Modified Hepburn; kanji fall back to Chinese pinyin readings.

This is context-free, character-by-character transliteration, the same approach as Unidecode. See docs/limitations.md for details on polyphony, phonological rules, and other trade-offs.

Precompiled pipelines

from translit import security_clean, ml_normalize, catalog_key, sanitize_user_input

# Security: NFKC → confusables → strip bidi → collapse whitespace
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")  # → "Real text"

# ML/NLP: NFKC → emoji→text → transliterate → strip accents → fold case
ml_normalize("Café ☕ Ünïcödé")  # → "cafe hot beverage unicode"

# Library catalog: NFKC → transliterate → confusables → strip accents → fold case
catalog_key("Москва", lang="ru")  # → "moskva"
catalog_key("ΩMEGA  café")        # → "omega cafe"

# Web input: NFKC → strip zalgo → confusables → strip bidi → collapse whitespace
sanitize_user_input("p\u0430ypal")  # → "paypal" (homoglyph neutralized)

Text builder

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# → "unicode cafe hot beverage"

Package structure

The API is organized into domain-specific namespaces. All functions are also available at the top level for convenience.

Namespace Purpose Key functions
translit Core transforms transliterate, slugify, Text, TextPipeline
translit.normalization Unicode normalization normalize, strip_accents, fold_case, collapse_whitespace
translit.security Safety analysis is_confusable, is_mixed_script, is_safe_hostname, security_clean
translit.files Filename handling sanitize_filename
translit.codec Byte decoding decode_to_utf8, detect_encoding
# Namespace imports
from translit.security import is_confusable, security_clean
from translit.codec import decode_to_utf8
from translit.normalization import fold_case

# Top-level imports also work
from translit import is_confusable, security_clean, decode_to_utf8, fold_case

Script policies

Transliteration applies different policies depending on the script. This table documents what each script does and which standard it follows.

Script Policy Standard / Source Example
Latin (accented) Accent stripping Unicode NFKD decomposition ée
Cyrillic Phonetic romanization BGN/PCGN (default), ISO 9:1995 (strict_iso9=True), GOST R 7.0.34 (gost7034=True) МоскваMoskva
Greek Transliteration BGN/PCGN romanization ΑθήναAthena
Chinese (Hanzi) Romanization Unihan kMandarin (toneless pinyin) 北京bei jing
Korean (Hangul) Romanization Revised Romanization of Korean 서울seo ul
Japanese (Kana) Romanization Modified Hepburn ひらがなhiragana
Japanese (Kanji) Romanization Falls back to Chinese pinyin readings 東京dong jing
Arabic Transliteration Buckwalter-derived مرحباmrhba
Hebrew Transliteration Common Israeli שלוםshlvm
Devanagari Transliteration UNGEGN/IAST-derived नमस्तेnamaste
Bengali Transliteration UNGEGN-derived কলকাতাkalakata
Tamil Transliteration UNGEGN-derived தமிழ்tamizh
Telugu Transliteration UNGEGN-derived తెలుగుtelugu
Gujarati Transliteration UNGEGN-derived ગુજરાતીgujarati
Kannada Transliteration UNGEGN-derived ಕನ್ನಡkannada
Malayalam Transliteration UNGEGN-derived മലയാളംmalayalam
Odia Transliteration UNGEGN-derived ଓଡିଆodia
Sinhala Transliteration UNGEGN-derived සිංහලsimhala
Gurmukhi Transliteration UNGEGN-derived ਪੰਜਾਬੀpanjabi
Thai Transliteration RTGS-derived สวัสดีsawatdi
Lao Transliteration BGN/PCGN-derived ລາວlao
Georgian Transliteration National romanization თბილისიtbilisi
Armenian Transliteration BGN/PCGN ԵրևանEryevan

All transliteration is context-free and character-by-character, the same approach as AnyAscii/Unidecode. No linguistic analysis, polyphony handling, or phonological rules. See docs/limitations.md for trade-offs.

Language-specific profiles (e.g., lang="de") apply sparse overrides on top of the default table. For example, German maps üue instead of the default u.

Language profiles

65 built-in language profiles with ISO 9:1995 scholarly Cyrillic support and 10 Indic scripts:

from translit import list_langs, transliterate

print(list_langs())
# ['am', 'ar', 'as', 'bg', 'bn', 'bo', 'ca', 'cs', 'cy', 'da', 'de', 'dv', 'el',
#  'es', 'et', 'fa', 'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy',
#  'is', 'it', 'ja', 'jv', 'ka', 'km', 'kn', 'ko', 'lo', 'lt', 'lv', 'ml', 'mn',
#  'mr', 'mt', 'my', 'ne', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sa',
#  'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'vi', 'zh']

# ISO 9:1995 scholarly transliteration
transliterate("Юрий", strict_iso9=True)  # → "Jurij"

Performance

translit is compiled Rust with O(1) compile-time perfect hash tables — no regex, no per-character Python iteration, no runtime data loading.

Operation Throughput vs. legacy
Transliterate (Latin) 450M chars/sec 38× faster than Unidecode
Transliterate (Cyrillic) 130M chars/sec 18× faster than Unidecode
Slugify 849K slugs/sec 10–24× faster than python-slugify
Batch transliterate (100 strings) 2.8× faster than loop

See docs/performance.md for full benchmark methodology and results.

Drop-in replacement

translit provides compatibility aliases for painless migration from existing libraries:

from translit import unidecode, casefold, remove_accents

unidecode("café")        # → "cafe"       (alias for transliterate)
casefold("Straße")       # → "strasse"    (alias for fold_case)
remove_accents("café")   # → "cafe"       (alias for strip_accents)

sanitize_filename() also accepts replacement_text and max_len kwargs for pathvalidate compatibility, and is_confusable() accepts greedy for confusable_homoglyphs compatibility. See migration guides for details.

Documentation

Guides by role:

Architecture

Rust core with compile-time PHF (perfect hash function) tables for O(1) per-character lookup. Exposed to Python via PyO3 with the stable ABI (abi3-py39). The Chinese pinyin table contains 20,924 entries from the Unicode Unihan database; Korean romanization is purely algorithmic (jamo decomposition, ~100 lines of Rust).

Links

Source code https://github.com/raeq/translit
Releases https://github.com/raeq/translit/releases
PyPI package https://pypi.org/project/translit-rs/
Documentation https://translit.readthedocs.io/
Issue tracker https://github.com/raeq/translit/issues
Changelog https://github.com/raeq/translit/blob/main/CHANGELOG.md

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translit_rs-0.1.5.tar.gz (728.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

translit_rs-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.5-cp39-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

translit_rs-0.1.5-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

translit_rs-0.1.5-cp39-abi3-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

translit_rs-0.1.5-cp39-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

translit_rs-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file translit_rs-0.1.5.tar.gz.

File metadata

  • Download URL: translit_rs-0.1.5.tar.gz
  • Upload date:
  • Size: 728.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.1.5.tar.gz
Algorithm Hash digest
SHA256 ea6ddfc177db7d9d797367e0a87ee2c062258a8b3507bcba3dfb0ae96c71fdc9
MD5 5beb1f95f9427cd5ebf79570b2884a41
BLAKE2b-256 57d80e5569afa12f45bed7d3bac7fed11c188e6a13b7c7f25b9e21899b2ad223

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5.tar.gz:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4ff1743f12cd508301fe3f4fd7301488a1cffa48431c0639c5e3cc2ecbaf2429
MD5 ebaed9e479d9de1ba4a5e772bdc59c2a
BLAKE2b-256 cd7e8e08e7b59119ffeb2542d9d7cf94cd81df145a5a2f84def33d4d07da5f73

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4b37e57418a6a4f8789554d4c93c1c3298b3b534882d545aac3795f0fb375799
MD5 6db80873d119fad0abd99ccbc97a42c0
BLAKE2b-256 4d71b43e323e871ad07761082723a940f8c5eb1275274d6cf220590271eaac3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.5-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: translit_rs-0.1.5-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.1.5-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3eb051d961715960b160d227c1a06d4b47613bfaceba7aaf4d765237c94e4803
MD5 287e462b946167beb4e73831d036ca2d
BLAKE2b-256 049691efb95b7a0fcbc687b29efc3cf0df16916dc866a9551ffdd4afdccf4beb

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5-cp39-abi3-win_amd64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.5-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.5-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 81c07e70607370f9a4ad49b20982d484cdd60582ccac32d999aa1c30265dced3
MD5 49ab36f1ddf8b2432fdd54f2d9180552
BLAKE2b-256 699144b750726eb285e87b0d60520e2a220ae88611a6fe6f3c18678a1120e4bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.5-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.5-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b6239721cf8550b4981ae99886f1deb18ea1cf0445b587ce5fd33126b9811019
MD5 32c371c96679203f2bb80ae6807cd02f
BLAKE2b-256 5f66895efc6be87667f5859364b8091821ad82354f87afb569c6fa59d3de8be3

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.5-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.5-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1f701b8b28f1caaf548b31b81dbc7dfe93ff660d95e0a78adc29468df0eec9d3
MD5 fcc8b1846e171f52571a31267e474c2b
BLAKE2b-256 0ad4c2fdd91884179c99958217fc373ba6db305736d0a37a5acda5a434e65dcc

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 26feebfc27c49409fc881fd7e6385199ca20f1f69840bbf3f2fb4100022bae0a
MD5 fa84da01e3024be95a3c93f9d6b7c455
BLAKE2b-256 b30fece6abca685329d8c0c3b00eb5defd8a6ca4c1c5df7970364a009ff4f521

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page