Skip to main content

Fast Unicode transliteration (including CJK), slugification, and text normalization — Rust-powered Python library

Project description

translit

Documentation License: MIT

Unicode text infrastructure for Python: transliteration, normalization, and safety analysis, powered by Rust.

Documentation | API Reference | PyPI

Demo

Try translit in your browser

Features

All text processing is implemented in Rust with O(1) PHF lookups and exposed to Python via PyO3.

Installation

pip install translit-rs

The package installs as translit-rs on PyPI but imports as translit:

import translit  # not translit_rs

Requires Python 3.9+. Wheels are available for Linux, macOS, and Windows.

Quick start

from translit import transliterate, slugify, sanitize_filename

# Latin/Cyrillic/Greek
transliterate("café")          # → "cafe"
transliterate("Москва")        # → "Moskva"
transliterate("Ünïcödé")       # → "Unicode"

# Chinese (Hanzi → Pinyin)
transliterate("北京市")         # → "bei jing shi"
slugify("北京烤鸭")            # → "bei-jing-kao-ya"

# Korean (Hangul → Revised Romanization)
transliterate("서울")           # → "seo ul"
slugify("대한민국")            # → "dae-han-min-gug"

# Japanese (Hiragana/Katakana → Hepburn)
transliterate("ひらがな")       # → "hiragana"
transliterate("カタカナ")       # → "katakana"

# Language-specific transliteration
transliterate("Ärger", lang="de")  # → "Aerger"
transliterate("Київ", lang="uk")   # → "Kyiv"

# Auto-detect language from script
transliterate("Москва", lang="auto")  # → "Moskva" (detects Cyrillic → Russian)
transliterate("ภาษาไทย", lang="auto")  # → Thai transliteration (detects Thai)

# Reverse transliteration (Latin → native script)
transliterate("Moskva", target="ru")   # → "Москва"
transliterate("Athina", target="el")   # → "Αθηνα"

# Slugification
slugify("Hello World!")            # → "hello-world"
slugify("café au lait")           # → "cafe-au-lait"

# Filename sanitization
sanitize_filename("my file<>.txt")         # → "my_file.txt"
sanitize_filename("CON.txt")               # → "_CON.txt"
sanitize_filename("../../etc/passwd")      # → ".etc_passwd"

CJK transliteration

Chinese characters are mapped to toneless pinyin from the Unicode Unihan kMandarin field, covering the full CJK Unified Ideographs block (U+4E00–U+9FFF, 20,924 characters). Korean Hangul syllables are algorithmically decomposed into jamo and romanized using the Revised Romanization standard (all 11,172 precomposed syllables). Japanese hiragana and katakana use Modified Hepburn; kanji fall back to Chinese pinyin readings.

This is context-free, character-by-character transliteration, the same approach as Unidecode. See docs/limitations.md for details on polyphony, phonological rules, and other trade-offs.

Precompiled pipelines

from translit import security_clean, ml_normalize, catalog_key, sanitize_user_input, strip_obfuscation

# Security: NFKC → confusables → strip bidi → collapse whitespace
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")  # → "Real text"

# ML/NLP: NFKC → emoji→text → transliterate → strip accents → fold case
ml_normalize("Café ☕ Ünïcödé")  # → "cafe hot beverage unicode"

# Library catalog: NFKC → transliterate → confusables → strip accents → fold case
catalog_key("Москва", lang="ru")  # → "moskva"
catalog_key("ΩMEGA  café")        # → "omega cafe"

# Web input: NFKC → strip zalgo → confusables → strip bidi → collapse whitespace
sanitize_user_input("p\u0430ypal")  # → "paypal" (homoglyph neutralized)

# Maximum deobfuscation: homoglyphs, zalgo, invisible chars → clean text
strip_obfuscation("p\u0440odu\u0441t")       # → "product" (Cyrillic р→p, с→c via TR39)
strip_obfuscation("p\u0430yp\u0430l 🔥🔥")  # → "paypal fire fire"
# Note: does NOT transliterate — chain with transliterate() if needed

Text builder

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# → "unicode cafe hot beverage"

Package structure

The API is organized into domain-specific namespaces. All functions are also available at the top level for convenience.

Namespace Purpose Key functions
translit Core transforms transliterate, slugify, Text, TextPipeline
translit.normalization Unicode normalization normalize, strip_accents, fold_case, collapse_whitespace
translit.security Safety analysis is_confusable, is_mixed_script, is_safe_hostname, security_clean
translit.files Filename handling sanitize_filename
translit.codec Byte decoding decode_to_utf8, detect_encoding
# Namespace imports
from translit.security import is_confusable, security_clean
from translit.codec import decode_to_utf8
from translit.normalization import fold_case

# Top-level imports also work
from translit import is_confusable, security_clean, decode_to_utf8, fold_case

Script policies

Transliteration applies different policies depending on the script. This table documents what each script does and which standard it follows.

Script Policy Standard / Source Example
Latin (accented) Accent stripping Unicode NFKD decomposition ée
Cyrillic Phonetic romanization BGN/PCGN (default), ISO 9:1995 (strict_iso9=True), GOST R 7.0.34 (gost7034=True) МоскваMoskva
Greek Transliteration BGN/PCGN romanization ΑθήναAthena
Chinese (Hanzi) Romanization Unihan kMandarin (toneless pinyin) 北京bei jing
Korean (Hangul) Romanization Revised Romanization of Korean 서울seo ul
Japanese (Kana) Romanization Modified Hepburn ひらがなhiragana
Japanese (Kanji) Romanization Falls back to Chinese pinyin readings 東京dong jing
Arabic Transliteration Buckwalter-derived مرحباmrhba
Hebrew Transliteration Common Israeli שלוםshlvm
Devanagari Transliteration UNGEGN/IAST-derived नमस्तेnamaste
Bengali Transliteration UNGEGN-derived কলকাতাkalakata
Tamil Transliteration UNGEGN-derived தமிழ்tamizh
Telugu Transliteration UNGEGN-derived తెలుగుtelugu
Gujarati Transliteration UNGEGN-derived ગુજરાતીgujarati
Kannada Transliteration UNGEGN-derived ಕನ್ನಡkannada
Malayalam Transliteration UNGEGN-derived മലയാളംmalayalam
Odia Transliteration UNGEGN-derived ଓଡିଆodia
Sinhala Transliteration UNGEGN-derived සිංහලsimhala
Gurmukhi Transliteration UNGEGN-derived ਪੰਜਾਬੀpanjabi
Thai Transliteration RTGS-derived สวัสดีsawatdi
Lao Transliteration BGN/PCGN-derived ລາວlao
Georgian Transliteration National romanization თბილისიtbilisi
Armenian Transliteration BGN/PCGN ԵրևանEryevan

All transliteration is context-free and character-by-character, the same approach as AnyAscii/Unidecode. No linguistic analysis, polyphony handling, or phonological rules. See docs/limitations.md for trade-offs.

Language-specific profiles (e.g., lang="de") apply sparse overrides on top of the default table. For example, German maps üue instead of the default u.

Language profiles

83 built-in language profiles with ISO 9:1995 scholarly Cyrillic support and 10 Indic scripts:

from translit import list_langs, transliterate

print(list_langs())
# ['am', 'ar', 'as', 'bg', 'bn', 'bo', 'ca', 'cs', 'cy', 'da', 'de', 'dv', 'el',
#  'es', 'et', 'fa', 'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy',
#  'is', 'it', 'ja', 'jv', 'ka', 'km', 'kn', 'ko', 'lo', 'lt', 'lv', 'ml', 'mn',
#  'mr', 'mt', 'my', 'ne', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sa',
#  'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'vi', 'zh']

# ISO 9:1995 scholarly transliteration
transliterate("Юрий", strict_iso9=True)  # → "Jurij"

Performance

translit is compiled Rust with O(1) compile-time perfect hash tables — no regex, no per-character Python iteration, no runtime data loading.

Operation Throughput vs. legacy
Transliterate (Latin) 450M chars/sec 38× faster than Unidecode
Transliterate (Cyrillic) 130M chars/sec 18× faster than Unidecode
Slugify 849K slugs/sec 10–24× faster than python-slugify
Batch transliterate (100 strings) 2.8× faster than loop

See docs/performance.md for full benchmark methodology and results.

Drop-in replacement

translit provides compatibility aliases for painless migration from existing libraries:

from translit import unidecode, casefold, remove_accents

unidecode("café")        # → "cafe"       (alias for transliterate)
casefold("Straße")       # → "strasse"    (alias for fold_case)
remove_accents("café")   # → "cafe"       (alias for strip_accents)

sanitize_filename() also accepts replacement_text and max_len kwargs for pathvalidate compatibility, and is_confusable() accepts greedy for confusable_homoglyphs compatibility. See migration guides for details.

Documentation

Exhaustive testing

translit is exhaustively tested with three layers of machine-verifiable assurance beyond conventional unit and property-based tests:

  • Compile-time assertions: build.rs asserts all transliteration table values are ASCII and entry counts match expectations — if any check fails, cargo build fails
  • Exhaustive domain coverage: Every Hangul syllable (11,172), every BMP codepoint (63,488), every CJK ideograph (20,992), and every Indic script block are tested individually — zero sampling gaps
  • Stated invariants: Seven stated properties (ASCII passthrough, idempotence, determinism, output bounds, etc.) verified by exhaustive enumeration and Hypothesis

See docs/formal-verification.md for details.

Architecture

Rust core with compile-time PHF (perfect hash function) tables for O(1) per-character lookup. Exposed to Python via PyO3 with the stable ABI (abi3-py39). The Chinese pinyin table contains 20,924 entries from the Unicode Unihan database; Korean romanization is purely algorithmic (jamo decomposition, ~100 lines of Rust).

Links

Source code https://github.com/raeq/translit
Releases https://github.com/raeq/translit/releases
PyPI package https://pypi.org/project/translit-rs/
Documentation https://translit.readthedocs.io/
Issue tracker https://github.com/raeq/translit/issues
Changelog https://github.com/raeq/translit/blob/main/CHANGELOG.md

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translit_rs-0.4.0.tar.gz (851.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

translit_rs-0.4.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.4.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.4.0-cp39-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

translit_rs-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

translit_rs-0.4.0-cp39-abi3-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

translit_rs-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

translit_rs-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file translit_rs-0.4.0.tar.gz.

File metadata

  • Download URL: translit_rs-0.4.0.tar.gz
  • Upload date:
  • Size: 851.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.4.0.tar.gz
Algorithm Hash digest
SHA256 8dcc1d3982b9fb2e19b37e20fdb61a65aba9f571acfd3440da1a2be6d2593664
MD5 7d66d30a61fcc63a34f6d4d9040c3728
BLAKE2b-256 26fb38366ea446c0a6911e11e4bdda9fefc1dc0805959d4a0b082f2301b1c8ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0.tar.gz:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.4.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.4.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c4fdfdf4ceee376896412f8bd4edab3d6b9442e0a5e62f484364949ed30161f5
MD5 18e06202dfccb0e0f0aac7bc8a46e512
BLAKE2b-256 6975b324d97ce1578d93fb92e24fb1ae4e8d01c36ab2cf5d2dc5fefbe5e537e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.4.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.4.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3d9107a82e602885490f73658166f97870798ae421df2470bde3afbc78061a7d
MD5 ac754956a16485696437cb4efc47eadc
BLAKE2b-256 f24742c40b22f4286f1b7c6d1fd11a713940e2d7a88c4c8a9635969e51850a1f

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.4.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: translit_rs-0.4.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.4.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3451a1ed9ea707f450143da2aa2a59eef7a156fc49829841c3da126f78b2476c
MD5 91724e89ef5180ec49638a346e106f53
BLAKE2b-256 b952270370fdbb316c56a99fd482aa2d8f25f04634e624ed96f35e2ad4a780e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0-cp39-abi3-win_amd64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4d33f733c0b16664186aa20dc3bb0936fb18426543a5dc8f3ebc72755ea4446b
MD5 95817e6d5a39b86c935332689d65aac3
BLAKE2b-256 16f4301e3bbd67ed2d01cbb9e5d900cbb42df37da555a21c18f8ecba28d6b9e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.4.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for translit_rs-0.4.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6377df8c692d0fbf60942e23367f8d305c63613498cd9c3e7bfd583a2428549b
MD5 5fda5ebdeb9fc73af48cc11ee52359ca
BLAKE2b-256 1f975a477112802ee474ef54a4d21002ffc31cfa79ae95ea08f0074c299b8b9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4744509227b4b8643a15a4fdf04549496ada98e16324e58446d14fa25db67118
MD5 f249d089f293360e390f7e20357bd4fc
BLAKE2b-256 e446fdecce5ccba8bb9ae9103464ff2debc1abc8a0e09cc76bf42214572cb897

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d27eefeb0672ee96c23297d33cc045bb619e6c0a32e84ed67c3297cc40bdf61d
MD5 054bd457e2b625cbbfa5b76b12b6665e
BLAKE2b-256 2d4bd5db12d26fea8984557882ce768aad5a3fb047956e1ee1d55f50a0c2dea1

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page