Skip to main content

Fast Unicode transliteration (including CJK), slugification, and text normalization — Rust-powered Python library

Project description

translit

Documentation License: MIT

Unicode text infrastructure for Python: transliteration, normalization, and safety analysis, powered by Rust.

Documentation | GitHub | PyPI

Features

  • Transliteration: Unicode → ASCII for Latin, Cyrillic, Greek, CJK (Chinese pinyin, Korean romanization, Japanese kana), and 37 language-specific profiles
  • Slugification: URL-safe slugs with python-slugify parameter compatibility
  • Filename sanitization: Cross-platform safe filenames with NFC normalization, path traversal protection, and Windows reserved name handling
  • Text normalization: NFC/NFD/NFKC/NFKD, confusable homoglyph detection (TR39), full Unicode case folding (1,557 CaseFolding.txt mappings via PHF), whitespace collapse
  • Precompiled pipelines: security_clean, ml_normalize, catalog_key, display_clean for common workflows
  • Grapheme clusters: Correct user-perceived character counting, splitting, and truncation
  • Hostname safety: Mixed-script and homoglyph attack detection
  • Encoding detection: Auto-detect and decode byte sequences to UTF-8 (chardetng)

All text processing is implemented in Rust with O(1) PHF lookups and exposed to Python via PyO3.

Installation

pip install translit-rs

The package installs as translit-rs on PyPI but imports as translit:

import translit  # not translit_rs

Requires Python 3.9+. Wheels are available for Linux, macOS, and Windows.

Quick start

from translit import transliterate, slugify, sanitize_filename

# Latin/Cyrillic/Greek
transliterate("café")          # → "cafe"
transliterate("Москва")        # → "Moskva"
transliterate("Ünïcödé")       # → "Unicode"

# Chinese (Hanzi → Pinyin)
transliterate("北京市")         # → "bei jing shi"
slugify("北京烤鸭")            # → "bei-jing-kao-ya"

# Korean (Hangul → Revised Romanization)
transliterate("서울")           # → "seo ul"
slugify("대한민국")            # → "dae-han-min-gug"

# Japanese (Hiragana/Katakana → Hepburn)
transliterate("ひらがな")       # → "hiragana"
transliterate("カタカナ")       # → "katakana"

# Language-specific transliteration
transliterate("Ärger", lang="de")  # → "Aerger"
transliterate("Київ", lang="uk")   # → "Kyiv"

# Slugification
slugify("Hello World!")            # → "hello-world"
slugify("café au lait")           # → "cafe-au-lait"

# Filename sanitization
sanitize_filename("my file<>.txt")         # → "my_file.txt"
sanitize_filename("CON.txt")               # → "_CON.txt"
sanitize_filename("../../etc/passwd")      # → ".etc_passwd"

CJK transliteration

Chinese characters are mapped to toneless pinyin from the Unicode Unihan kMandarin field, covering the full CJK Unified Ideographs block (U+4E00–U+9FFF, 20,924 characters). Korean Hangul syllables are algorithmically decomposed into jamo and romanized using the Revised Romanization standard (all 11,172 precomposed syllables). Japanese hiragana and katakana use Modified Hepburn; kanji fall back to Chinese pinyin readings.

This is context-free, character-by-character transliteration, the same approach as Unidecode. See docs/limitations.md for details on polyphony, phonological rules, and other trade-offs.

Precompiled pipelines

from translit import security_clean, ml_normalize, catalog_key

# Security: NFKC → confusables → strip bidi → collapse whitespace
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")  # → "Real text"

# ML/NLP: NFKC → emoji→text → transliterate → strip accents → fold case
ml_normalize("Café ☕ Ünïcödé")  # → "cafe hot beverage unicode"

# Library catalog: NFKC → confusables → transliterate → strip accents → fold case
catalog_key("Москва", lang="ru")  # → "moskva"

Text builder

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# → "unicode cafe hot beverage"

Package structure

The API is organized into domain-specific namespaces. All functions are also available at the top level for convenience.

Namespace Purpose Key functions
translit Core transforms transliterate, slugify, Text, TextPipeline
translit.normalization Unicode normalization normalize, strip_accents, fold_case, collapse_whitespace
translit.security Safety analysis is_confusable, is_mixed_script, is_safe_hostname, security_clean
translit.files Filename handling sanitize_filename
translit.codec Byte decoding decode_to_utf8, detect_encoding
# Namespace imports
from translit.security import is_confusable, security_clean
from translit.codec import decode_to_utf8
from translit.normalization import fold_case

# Top-level imports also work
from translit import is_confusable, security_clean, decode_to_utf8, fold_case

Script policies

Transliteration applies different policies depending on the script. This table documents what each script does and which standard it follows.

Script Policy Standard / Source Example
Latin (accented) Accent stripping Unicode NFKD decomposition ée
Cyrillic Phonetic romanization ISO 9:1995 (scholarly, via strict_iso9=True) or GOST-based (default) МоскваMoskva
Greek Transliteration BGN/PCGN romanization ΑθήναAthena
Chinese (Hanzi) Romanization Unihan kMandarin (toneless pinyin) 北京bei jing
Korean (Hangul) Romanization Revised Romanization of Korean 서울seo ul
Japanese (Kana) Romanization Modified Hepburn ひらがなhiragana
Japanese (Kanji) Romanization Falls back to Chinese pinyin readings 東京dong jing
Arabic Transliteration Buckwalter-derived مرحباmrhba
Devanagari Transliteration IAST-derived नमस्तेnamaste
Georgian Transliteration National romanization თბილისიtbilisi
Armenian Transliteration BGN/PCGN ԵրևանErevan

All transliteration is context-free and character-by-character, the same approach as AnyAscii/Unidecode. No linguistic analysis, polyphony handling, or phonological rules. See docs/limitations.md for trade-offs.

Language-specific profiles (e.g., lang="de") apply sparse overrides on top of the default table. For example, German maps üue instead of the default u.

Language profiles

37 built-in language profiles with ISO 9:1995 scholarly Cyrillic support:

from translit import list_langs, transliterate

print(list_langs())
# ['ar', 'bg', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'es', 'et',
#  'fi', 'fr', 'ga', 'hr', 'hu', 'is', 'it', 'ja', 'ko', 'lt',
#  'lv', 'mt', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl',
#  'sq', 'sr', 'sv', 'tr', 'uk', 'vi', 'zh']

# ISO 9:1995 scholarly transliteration
transliterate("Юрий", strict_iso9=True)  # → "Jurij"

Performance

translit is compiled Rust with O(1) compile-time perfect hash tables — no regex, no per-character Python iteration, no runtime data loading.

Operation Throughput vs. legacy
Transliterate (Latin) 693M chars/sec 58× faster than Unidecode
Transliterate (Cyrillic) 196M chars/sec 27× faster than Unidecode
Slugify 1.12M slugs/sec 10–24× faster than python-slugify
Batch transliterate (100 strings) 2.7× faster than loop

See docs/performance.md for full benchmark methodology and results.

Drop-in replacement

translit provides compatibility aliases for painless migration from existing libraries:

from translit import unidecode, casefold, remove_accents

unidecode("café")        # → "cafe"       (alias for transliterate)
casefold("Straße")       # → "strasse"    (alias for fold_case)
remove_accents("café")   # → "cafe"       (alias for strip_accents)

sanitize_filename() also accepts replacement_text and max_len kwargs for pathvalidate compatibility, and is_confusable() accepts greedy for confusable_homoglyphs compatibility. See migration guides for details.

Documentation

Guides by role:

Architecture

Rust core with compile-time PHF (perfect hash function) tables for O(1) per-character lookup. Exposed to Python via PyO3 with the stable ABI (abi3-py39). The Chinese pinyin table contains 20,924 entries from the Unicode Unihan database; Korean romanization is purely algorithmic (jamo decomposition, ~100 lines of Rust).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translit_rs-0.1.2.tar.gz (515.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

translit_rs-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

translit_rs-0.1.2-cp39-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

translit_rs-0.1.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

translit_rs-0.1.2-cp39-abi3-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

translit_rs-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

translit_rs-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file translit_rs-0.1.2.tar.gz.

File metadata

  • Download URL: translit_rs-0.1.2.tar.gz
  • Upload date:
  • Size: 515.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6c6aa84048e4145e31006eb5f6b1ff9d81704a8a9b9f745c5f38917599d8b900
MD5 43f1534f8e4dd7789e8714a9362de0b0
BLAKE2b-256 90496d7a9e12477ef20605c194d2413d08811d7f9e267b873a327240435d756f

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2.tar.gz:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1e871fe0608b1c092167d6f085854e148abbb71093e5dd27f749df582c6a01e2
MD5 5c848727498589f086452c96e4809fea
BLAKE2b-256 190b314dcf85d26aa959d5d2e67f71afe2c3d33449b25fc7f57cf42768388514

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9c1488f811853e33b0ea5e7822024e509a33c702db0530571a5944044fb09520
MD5 58e8307aecbfb2d9cf4b3a3016ec3324
BLAKE2b-256 665b36bcf6723c3430f48c5fa9fc6b69834e2111013c00de42be75cd88dd1578

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.2-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: translit_rs-0.1.2-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for translit_rs-0.1.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5721c863e6286bbf43cb6cf0726827ff6787226552dafbe517af28021831b3b9
MD5 d22dd2ce6b1f6f32188e6c77e3262a1e
BLAKE2b-256 a6e2815d48aaf1764c70a006178f39ac7c5967b7ac76dffba6c90f920b825475

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2-cp39-abi3-win_amd64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fbb3cf979f86e2ba57c5f0404bf1ca353d74cb1506fc32c6f0a54bc158df9c58
MD5 fc39bc9593672cfbefa5064204fd6ecd
BLAKE2b-256 b6f4a755e8d483ee8c8ddbee03eda0cea34973b4e868ff141cba5cf1b7846de2

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6092168de584f0b45bfc05abb58d550f753b388826396f26bc25cce2bbd0b9cf
MD5 b0ee2106bd7d860e0bbb192a7d2de180
BLAKE2b-256 d73e87c427444a266076f4194a9a9b73779762afca4aa04a907fcec0cc9045ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 29abb5bfaa95861f8c6b0b8ec3d1980efcba987f934149d8b5a31d5f699afb56
MD5 4390f44e889d53bc4f147ee5b68475ed
BLAKE2b-256 c877ccba46dc4de6ec7b5214744e8d7addcab77ba1117a1e233a03633624e525

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 116e1d219d158e04b9eb9d9d6eb4555f6a1757821b1d09dc1b72c6f07102302e
MD5 6cf0bc6700c2cbcfcc84621889eb62d6
BLAKE2b-256 712d64080b0fb53ae8aab86efd5c1b003abaae238d10166f4fda38486ddd30ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page