Skip to main content

Fast Unicode transliteration (including CJK), slugification, and text normalization — Rust-powered Python library

Project description

translit

Documentation License: MIT

Unicode text infrastructure for Python: transliteration, normalization, and safety analysis, powered by Rust.

Documentation | API Reference | PyPI

Features

All text processing is implemented in Rust with O(1) PHF lookups and exposed to Python via PyO3.

Installation

pip install translit-rs

The package installs as translit-rs on PyPI but imports as translit:

import translit  # not translit_rs

Requires Python 3.9+. Wheels are available for Linux, macOS, and Windows.

Quick start

from translit import transliterate, slugify, sanitize_filename

# Latin/Cyrillic/Greek
transliterate("café")          # → "cafe"
transliterate("Москва")        # → "Moskva"
transliterate("Ünïcödé")       # → "Unicode"

# Chinese (Hanzi → Pinyin)
transliterate("北京市")         # → "bei jing shi"
slugify("北京烤鸭")            # → "bei-jing-kao-ya"

# Korean (Hangul → Revised Romanization)
transliterate("서울")           # → "seo ul"
slugify("대한민국")            # → "dae-han-min-gug"

# Japanese (Hiragana/Katakana → Hepburn)
transliterate("ひらがな")       # → "hiragana"
transliterate("カタカナ")       # → "katakana"

# Language-specific transliteration
transliterate("Ärger", lang="de")  # → "Aerger"
transliterate("Київ", lang="uk")   # → "Kyiv"

# Auto-detect language from script
transliterate("Москва", lang="auto")  # → "Moskva" (detects Cyrillic → Russian)
transliterate("ภาษาไทย", lang="auto")  # → Thai transliteration (detects Thai)

# Reverse transliteration (Latin → native script)
transliterate("Moskva", target="ru")   # → "Москва"
transliterate("Athina", target="el")   # → "Αθηνα"

# Slugification
slugify("Hello World!")            # → "hello-world"
slugify("café au lait")           # → "cafe-au-lait"

# Filename sanitization
sanitize_filename("my file<>.txt")         # → "my_file.txt"
sanitize_filename("CON.txt")               # → "_CON.txt"
sanitize_filename("../../etc/passwd")      # → ".etc_passwd"

CJK transliteration

Chinese characters are mapped to toneless pinyin from the Unicode Unihan kMandarin field, covering the full CJK Unified Ideographs block (U+4E00–U+9FFF, 20,924 characters). Korean Hangul syllables are algorithmically decomposed into jamo and romanized using the Revised Romanization standard (all 11,172 precomposed syllables). Japanese hiragana and katakana use Modified Hepburn; kanji fall back to Chinese pinyin readings.

This is context-free, character-by-character transliteration, the same approach as Unidecode. See docs/limitations.md for details on polyphony, phonological rules, and other trade-offs.

Precompiled pipelines

from translit import security_clean, ml_normalize, catalog_key, sanitize_user_input

# Security: NFKC → confusables → strip bidi → collapse whitespace
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")  # → "Real text"

# ML/NLP: NFKC → emoji→text → transliterate → strip accents → fold case
ml_normalize("Café ☕ Ünïcödé")  # → "cafe hot beverage unicode"

# Library catalog: NFKC → transliterate → confusables → strip accents → fold case
catalog_key("Москва", lang="ru")  # → "moskva"
catalog_key("ΩMEGA  café")        # → "omega cafe"

# Web input: NFKC → strip zalgo → confusables → strip bidi → collapse whitespace
sanitize_user_input("p\u0430ypal")  # → "paypal" (homoglyph neutralized)

Text builder

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# → "unicode cafe hot beverage"

Package structure

The API is organized into domain-specific namespaces. All functions are also available at the top level for convenience.

Namespace Purpose Key functions
translit Core transforms transliterate, slugify, Text, TextPipeline
translit.normalization Unicode normalization normalize, strip_accents, fold_case, collapse_whitespace
translit.security Safety analysis is_confusable, is_mixed_script, is_safe_hostname, security_clean
translit.files Filename handling sanitize_filename
translit.codec Byte decoding decode_to_utf8, detect_encoding
# Namespace imports
from translit.security import is_confusable, security_clean
from translit.codec import decode_to_utf8
from translit.normalization import fold_case

# Top-level imports also work
from translit import is_confusable, security_clean, decode_to_utf8, fold_case

Script policies

Transliteration applies different policies depending on the script. This table documents what each script does and which standard it follows.

Script Policy Standard / Source Example
Latin (accented) Accent stripping Unicode NFKD decomposition ée
Cyrillic Phonetic romanization BGN/PCGN (default), ISO 9:1995 (strict_iso9=True), GOST R 7.0.34 (gost7034=True) МоскваMoskva
Greek Transliteration BGN/PCGN romanization ΑθήναAthena
Chinese (Hanzi) Romanization Unihan kMandarin (toneless pinyin) 北京bei jing
Korean (Hangul) Romanization Revised Romanization of Korean 서울seo ul
Japanese (Kana) Romanization Modified Hepburn ひらがなhiragana
Japanese (Kanji) Romanization Falls back to Chinese pinyin readings 東京dong jing
Arabic Transliteration Buckwalter-derived مرحباmrhba
Hebrew Transliteration Common Israeli שלוםshlvm
Devanagari Transliteration UNGEGN/IAST-derived नमस्तेnamaste
Bengali Transliteration UNGEGN-derived কলকাতাkalakata
Tamil Transliteration UNGEGN-derived தமிழ்tamizh
Telugu Transliteration UNGEGN-derived తెలుగుtelugu
Gujarati Transliteration UNGEGN-derived ગુજરાતીgujarati
Kannada Transliteration UNGEGN-derived ಕನ್ನಡkannada
Malayalam Transliteration UNGEGN-derived മലയാളംmalayalam
Odia Transliteration UNGEGN-derived ଓଡିଆodia
Sinhala Transliteration UNGEGN-derived සිංහලsimhala
Gurmukhi Transliteration UNGEGN-derived ਪੰਜਾਬੀpanjabi
Thai Transliteration RTGS-derived สวัสดีsawatdi
Lao Transliteration BGN/PCGN-derived ລາວlao
Georgian Transliteration National romanization თბილისიtbilisi
Armenian Transliteration BGN/PCGN ԵրևանEryevan

All transliteration is context-free and character-by-character, the same approach as AnyAscii/Unidecode. No linguistic analysis, polyphony handling, or phonological rules. See docs/limitations.md for trade-offs.

Language-specific profiles (e.g., lang="de") apply sparse overrides on top of the default table. For example, German maps üue instead of the default u.

Language profiles

65 built-in language profiles with ISO 9:1995 scholarly Cyrillic support and 10 Indic scripts:

from translit import list_langs, transliterate

print(list_langs())
# ['am', 'ar', 'as', 'bg', 'bn', 'bo', 'ca', 'cs', 'cy', 'da', 'de', 'dv', 'el',
#  'es', 'et', 'fa', 'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy',
#  'is', 'it', 'ja', 'jv', 'ka', 'km', 'kn', 'ko', 'lo', 'lt', 'lv', 'ml', 'mn',
#  'mr', 'mt', 'my', 'ne', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sa',
#  'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'vi', 'zh']

# ISO 9:1995 scholarly transliteration
transliterate("Юрий", strict_iso9=True)  # → "Jurij"

Performance

translit is compiled Rust with O(1) compile-time perfect hash tables — no regex, no per-character Python iteration, no runtime data loading.

Operation Throughput vs. legacy
Transliterate (Latin) 450M chars/sec 38× faster than Unidecode
Transliterate (Cyrillic) 130M chars/sec 18× faster than Unidecode
Slugify 849K slugs/sec 10–24× faster than python-slugify
Batch transliterate (100 strings) 2.8× faster than loop

See docs/performance.md for full benchmark methodology and results.

Drop-in replacement

translit provides compatibility aliases for painless migration from existing libraries:

from translit import unidecode, casefold, remove_accents

unidecode("café")        # → "cafe"       (alias for transliterate)
casefold("Straße")       # → "strasse"    (alias for fold_case)
remove_accents("café")   # → "cafe"       (alias for strip_accents)

sanitize_filename() also accepts replacement_text and max_len kwargs for pathvalidate compatibility, and is_confusable() accepts greedy for confusable_homoglyphs compatibility. See migration guides for details.

Documentation

Guides by role:

Architecture

Rust core with compile-time PHF (perfect hash function) tables for O(1) per-character lookup. Exposed to Python via PyO3 with the stable ABI (abi3-py39). The Chinese pinyin table contains 20,924 entries from the Unicode Unihan database; Korean romanization is purely algorithmic (jamo decomposition, ~100 lines of Rust).

Links

Source code https://github.com/raeq/translit
Releases https://github.com/raeq/translit/releases
PyPI package https://pypi.org/project/translit-rs/
Documentation https://translit.readthedocs.io/
Issue tracker https://github.com/raeq/translit/issues
Changelog https://github.com/raeq/translit/blob/main/CHANGELOG.md

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translit_rs-0.1.15.tar.gz (728.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

translit_rs-0.1.15-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.15-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.15-cp39-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

translit_rs-0.1.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

translit_rs-0.1.15-cp39-abi3-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

translit_rs-0.1.15-cp39-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

translit_rs-0.1.15-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file translit_rs-0.1.15.tar.gz.

File metadata

  • Download URL: translit_rs-0.1.15.tar.gz
  • Upload date:
  • Size: 728.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.1.15.tar.gz
Algorithm Hash digest
SHA256 7167be89e736972c3ee1ce6830834e15d97214c8b46a8c4153b5d061fdb7d13b
MD5 84443fd51ab6edb4c7570d00e5539ad0
BLAKE2b-256 81917271a0477da2469147e99a4034adbb0267505173d4377dd98ae976361fd0

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15.tar.gz:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.15-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.15-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 59cc7a9831f5d6879ca94eab8b20e7faff8ff3dc58703dac55435e133d6ed1ca
MD5 f517950917a7015fd06ce1406776aa31
BLAKE2b-256 11c23b600abdf677b532a6657dc30021a4c6459a3694d9e5905932d308c92aaa

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.15-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.15-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d5139355f0ad59ea254caa953c65c59a0175b35060c2c688bd1263719ec55577
MD5 a2e951856d72d1726f5e8c720246abbc
BLAKE2b-256 90c43fcb7e88c60ca8af4dd19dc1136aaa89cfba49d367d02ee688a362075bd8

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.15-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: translit_rs-0.1.15-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.1.15-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2380dd025e2fda7140f3461c63e89217a96c29f99435fd6425adcec6ff600a3b
MD5 751e25ab8d71fa838469c28e0c5e8ba1
BLAKE2b-256 701a7cf893b20307d552c7a21dde118f8ac3801a47be31f9152279eda8fb5576

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15-cp39-abi3-win_amd64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 aeab633bff0c21dc5433cb9605582c7cab5e08a9ac6ef45bcf838816f38a3965
MD5 84e97b751de5f1d99824728044be4751
BLAKE2b-256 3426dc207965796f630a6853c5868de7fa00c1123eaea8fa98f7b2f1481e3dc1

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.15-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.15-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f43ea0c0e42236e10b3113fca0f11316a3731d4835171fd888afc3a7bf118911
MD5 8468f630244c3ed4b053fd41ec7f103a
BLAKE2b-256 fd4ea9fd8a637d34b1fa8b346487d7b03716776797302d98cb34ecd3075e46db

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.15-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.15-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7045d184d33310adcb63c193a7d410d2dd41c1784bf4ac4fa0ebff6707105c79
MD5 99b59f287db57ced48f8d8842f0d76b2
BLAKE2b-256 c2cab4e4ee4a3ebdd1889e3d3520995a6b28a6578cc28f9f53543399abd15e04

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.15-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.15-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b0fc2709c862416262122a9bf8adbae2243e028c459fd7e65f462629b669c074
MD5 4aba0c9009d376ff67e6552a6d41cabb
BLAKE2b-256 54aa89fac0b124988b176002adcdc41d9f52c05a95d8959e29b122978c0ede58

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.15-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page