Skip to main content

Fast Unicode transliteration (including CJK), slugification, and text normalization — Rust-powered Python library

Project description

translit

Documentation License: MIT

Unicode text infrastructure for Python: transliteration, normalization, and safety analysis, powered by Rust.

Documentation | GitHub | PyPI

Features

  • Transliteration: Unicode → ASCII for Latin, Cyrillic, Greek, CJK (Chinese pinyin, Korean romanization, Japanese kana), and 37 language-specific profiles
  • Slugification: URL-safe slugs with python-slugify parameter compatibility
  • Filename sanitization: Cross-platform safe filenames with NFC normalization, path traversal protection, and Windows reserved name handling
  • Text normalization: NFC/NFD/NFKC/NFKD, confusable homoglyph detection (TR39), full Unicode case folding (1,557 CaseFolding.txt mappings via PHF), whitespace collapse
  • Precompiled pipelines: security_clean, ml_normalize, catalog_key, display_clean for common workflows
  • Grapheme clusters: Correct user-perceived character counting, splitting, and truncation
  • Hostname safety: Mixed-script and homoglyph attack detection
  • Encoding detection: Auto-detect and decode byte sequences to UTF-8 (chardetng)

All text processing is implemented in Rust with O(1) PHF lookups and exposed to Python via PyO3.

Installation

pip install translit-rs

The package installs as translit-rs on PyPI but imports as translit:

import translit  # not translit_rs

Requires Python 3.9+. Wheels are available for Linux, macOS, and Windows.

Quick start

from translit import transliterate, slugify, sanitize_filename

# Latin/Cyrillic/Greek
transliterate("café")          # → "cafe"
transliterate("Москва")        # → "Moskva"
transliterate("Ünïcödé")       # → "Unicode"

# Chinese (Hanzi → Pinyin)
transliterate("北京市")         # → "bei jing shi"
slugify("北京烤鸭")            # → "bei-jing-kao-ya"

# Korean (Hangul → Revised Romanization)
transliterate("서울")           # → "seo ul"
slugify("대한민국")            # → "dae-han-min-gug"

# Japanese (Hiragana/Katakana → Hepburn)
transliterate("ひらがな")       # → "hiragana"
transliterate("カタカナ")       # → "katakana"

# Language-specific transliteration
transliterate("Ärger", lang="de")  # → "Aerger"
transliterate("Київ", lang="uk")   # → "Kyiv"

# Slugification
slugify("Hello World!")            # → "hello-world"
slugify("café au lait")           # → "cafe-au-lait"

# Filename sanitization
sanitize_filename("my file<>.txt")         # → "my_file.txt"
sanitize_filename("CON.txt")               # → "_CON.txt"
sanitize_filename("../../etc/passwd")      # → ".etc_passwd"

CJK transliteration

Chinese characters are mapped to toneless pinyin from the Unicode Unihan kMandarin field, covering the full CJK Unified Ideographs block (U+4E00–U+9FFF, 20,924 characters). Korean Hangul syllables are algorithmically decomposed into jamo and romanized using the Revised Romanization standard (all 11,172 precomposed syllables). Japanese hiragana and katakana use Modified Hepburn; kanji fall back to Chinese pinyin readings.

This is context-free, character-by-character transliteration, the same approach as Unidecode. See docs/limitations.md for details on polyphony, phonological rules, and other trade-offs.

Precompiled pipelines

from translit import security_clean, ml_normalize, catalog_key

# Security: NFKC → confusables → strip bidi → collapse whitespace
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")  # → "Real text"

# ML/NLP: NFKC → emoji→text → transliterate → strip accents → fold case
ml_normalize("Café ☕ Ünïcödé")  # → "cafe hot beverage unicode"

# Library catalog: NFKC → confusables → transliterate → strip accents → fold case
catalog_key("Москва", lang="ru")  # → "moskva"

Text builder

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# → "unicode cafe hot beverage"

Package structure

The API is organized into domain-specific namespaces. All functions are also available at the top level for convenience.

Namespace Purpose Key functions
translit Core transforms transliterate, slugify, Text, TextPipeline
translit.normalization Unicode normalization normalize, strip_accents, fold_case, collapse_whitespace
translit.security Safety analysis is_confusable, is_mixed_script, is_safe_hostname, security_clean
translit.files Filename handling sanitize_filename
translit.codec Byte decoding decode_to_utf8, detect_encoding
# Namespace imports
from translit.security import is_confusable, security_clean
from translit.codec import decode_to_utf8
from translit.normalization import fold_case

# Top-level imports also work
from translit import is_confusable, security_clean, decode_to_utf8, fold_case

Script policies

Transliteration applies different policies depending on the script. This table documents what each script does and which standard it follows.

Script Policy Standard / Source Example
Latin (accented) Accent stripping Unicode NFKD decomposition ée
Cyrillic Phonetic romanization ISO 9:1995 (scholarly, via strict_iso9=True) or GOST-based (default) МоскваMoskva
Greek Transliteration BGN/PCGN romanization ΑθήναAthena
Chinese (Hanzi) Romanization Unihan kMandarin (toneless pinyin) 北京bei jing
Korean (Hangul) Romanization Revised Romanization of Korean 서울seo ul
Japanese (Kana) Romanization Modified Hepburn ひらがなhiragana
Japanese (Kanji) Romanization Falls back to Chinese pinyin readings 東京dong jing
Arabic Transliteration Buckwalter-derived مرحباmrhba
Devanagari Transliteration IAST-derived नमस्तेnamaste
Georgian Transliteration National romanization თბილისიtbilisi
Armenian Transliteration BGN/PCGN ԵրևանErevan

All transliteration is context-free and character-by-character, the same approach as AnyAscii/Unidecode. No linguistic analysis, polyphony handling, or phonological rules. See docs/limitations.md for trade-offs.

Language-specific profiles (e.g., lang="de") apply sparse overrides on top of the default table. For example, German maps üue instead of the default u.

Language profiles

37 built-in language profiles with ISO 9:1995 scholarly Cyrillic support:

from translit import list_langs, transliterate

print(list_langs())
# ['ar', 'bg', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'es', 'et',
#  'fi', 'fr', 'ga', 'hr', 'hu', 'is', 'it', 'ja', 'ko', 'lt',
#  'lv', 'mt', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl',
#  'sq', 'sr', 'sv', 'tr', 'uk', 'vi', 'zh']

# ISO 9:1995 scholarly transliteration
transliterate("Юрий", strict_iso9=True)  # → "Jurij"

Performance

translit is compiled Rust with O(1) compile-time perfect hash tables — no regex, no per-character Python iteration, no runtime data loading.

Operation Throughput vs. legacy
Transliterate (Latin) 693M chars/sec 58× faster than Unidecode
Transliterate (Cyrillic) 196M chars/sec 27× faster than Unidecode
Slugify 1.12M slugs/sec 10–24× faster than python-slugify
Batch transliterate (100 strings) 2.7× faster than loop

See docs/performance.md for full benchmark methodology and results.

Drop-in replacement

translit provides compatibility aliases for painless migration from existing libraries:

from translit import unidecode, casefold, remove_accents

unidecode("café")        # → "cafe"       (alias for transliterate)
casefold("Straße")       # → "strasse"    (alias for fold_case)
remove_accents("café")   # → "cafe"       (alias for strip_accents)

sanitize_filename() also accepts replacement_text and max_len kwargs for pathvalidate compatibility, and is_confusable() accepts greedy for confusable_homoglyphs compatibility. See migration guides for details.

Documentation

Guides by role:

Architecture

Rust core with compile-time PHF (perfect hash function) tables for O(1) per-character lookup. Exposed to Python via PyO3 with the stable ABI (abi3-py39). The Chinese pinyin table contains 20,924 entries from the Unicode Unihan database; Korean romanization is purely algorithmic (jamo decomposition, ~100 lines of Rust).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translit_rs-0.1.0.tar.gz (513.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

translit_rs-0.1.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.0-cp39-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

translit_rs-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

translit_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

translit_rs-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

translit_rs-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file translit_rs-0.1.0.tar.gz.

File metadata

  • Download URL: translit_rs-0.1.0.tar.gz
  • Upload date:
  • Size: 513.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for translit_rs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f09d0a8366e12c05f00d3e7bcd9472b5fe6bb5f890bec0ce3dd675d695c2f5a8
MD5 1130d1f9087454ab2b71b1a3d6f9d134
BLAKE2b-256 64881522a26159e88681b3a11214fd7e976f517f5cac99186bc11cded706330d

See more details on using hashes here.

File details

Details for the file translit_rs-0.1.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 49ac841126f45c99f18a66ee26f8fb1a40d60eeb82f78a3c365c3830e3ad5230
MD5 74f33497f228874af1ac134e0d4cee44
BLAKE2b-256 a350de21359cb404bf4edda8bb530e0c72cf9f29c380b95378336912522daa35

See more details on using hashes here.

File details

Details for the file translit_rs-0.1.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 64aed1f96932fc28fe572d47ae85cdba581f8d211f9da76989696870d9fa32bd
MD5 82b82b034be6ece9c14aefe4dd76fcb3
BLAKE2b-256 94e36d0ae5225cdbace57390c34711a11adf0dc4dcfb9762c0de2897e930e2c8

See more details on using hashes here.

File details

Details for the file translit_rs-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 43e0b8ee07d8ed4f0d221552d741b3ba5c1b93d7e07f3f734f9ae98acae25cb3
MD5 02237dbc50c49c1a7eb9984c387d8534
BLAKE2b-256 4f6ff91a84563a73d1d4cde0496ce0f71fa235d59c83e19be1e99c828929dfb4

See more details on using hashes here.

File details

Details for the file translit_rs-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d62820cf8f88c084a3008c6e062f2f7dff2362979a585c8863169ee8df5ef54a
MD5 78b57f7e9c63d8b9495540cf9902ad88
BLAKE2b-256 6a8bbb5945538cf2014e746845ee613ecdd896b6cd20b6be9e55dc0e5052865c

See more details on using hashes here.

File details

Details for the file translit_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 78db23803899e0c1541a75878dde3d62db697dd1a7d14f3ad8d89af1aa9db79f
MD5 4845836182ce3d326e3c1cf8c9f54ae4
BLAKE2b-256 f67a0fab2cafee15f67f30d22e19b8e33e336c9f14f41e1cb5e9e53cee638586

See more details on using hashes here.

File details

Details for the file translit_rs-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f49bf624ad839f0d9b4c45394fef370a64b3f181a3af232f0fa1923211565b9f
MD5 1ee83f9d15a9bd0e995f05757c5d4af2
BLAKE2b-256 3847ffb6f619c899cfca6db096195f66b0e7542640d844f0135e83a7904a5622

See more details on using hashes here.

File details

Details for the file translit_rs-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 38444063f4564e46548178a48bdca57558dcb01021e23752b92d78f19d1f9885
MD5 caf257bb90e1ae42eebb99a6bcac73e8
BLAKE2b-256 cb2fe3fca8a3b7a1ad4c128b6a4246054382614313b86a617344bd62f4163cb9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page