Skip to main content

Unicode canonicalization and TR39 confusable analysis for Python: building blocks for text-security pipelines (homoglyph/bidi/zalgo/invisible-character handling) plus standards-based transliteration — powered by Rust

Project description

translit

Documentation License: MIT

Unicode canonicalization and TR39 confusable analysis for Python — building blocks for text-security pipelines (homoglyph/bidi/zalgo/invisible-character handling) plus standards-based transliteration. Rust-powered.

Documentation | API Reference | PyPI

Demo

Try translit in your browser

Why translit

The text-cleaning libraries already in most pipelines — ftfy, unidecode, anyascii — were built for encoding repair and ASCII conversion. They map confusables phonetically (Cyrillic р → Latin r), which does not reverse a homoglyph substitution.

translit implements visual confusable mapping per Unicode TR39 (Cyrillic р → Latin p). In a controlled benchmark (six attack types, three downstream tasks, two architectures; 435,864 observations), visual TR39 mapping reached XMR = 1.000 on the tested TR39 homoglyph pairs (17 Latin–Cyrillic, 19 Greek), where phonetic transliterators plateaued near half:

Tool class Mapping Homoglyph XMR (tested TR39 pairs)
unidecode, anyascii, cyrtranslit, uroman phonetic ~0.49
translit (strip_obfuscation / normalize_confusables) visual (TR39) 1.000

ftfy was statistically equivalent to no preprocessing; unidecode degraded accuracy on invisible-character attacks. Details: Adversarial-Text Defense (paper "Fire Extinguishers Full of Gasoline"; XMR metric: Zenodo 10.5281/zenodo.19323513).

Scope. translit is a defense-in-depth layer, not a complete control. It canonicalizes the confusables it bundles (TR39) and strips the format characters it enumerates; it does not promise to stop any attack class, and the confusable space is far larger than any table. See the Threat Model for what is and isn't in scope.

from translit import strip_obfuscation, normalize_confusables, is_safe_hostname

# Fold Cyrillic look-alikes to their Latin prototypes (TR39 visual mapping)
strip_obfuscation("рroduсt")        # → "product"  (р→p, с→c)
strip_obfuscation("pаypаl 🔥🔥")     # → "paypal fire fire"  (also strips zalgo/bidi/invisible/emoji)

normalize_confusables("раypal")      # → "paypal"   (mixed Cyrillic skeleton → Latin)

# IDN / hostname spoofing check
safe, details = is_safe_hostname("аpple.com")   # leading Cyrillic а
# safe is False; details.has_confusables and details.mixed_script flag why

Installation

pip install translit-rs

The package installs as translit-rs on PyPI but imports as translit:

import translit  # not translit_rs

Requires Python 3.9+. Wheels are available for Linux, macOS, and Windows.

Features

All text processing is implemented in Rust with O(1) PHF lookups and exposed to Python via PyO3.

Quick start

Defense & canonicalization

from translit import (
    is_confusable, normalize_confusables, strip_obfuscation,
    security_clean, sanitize_user_input,
)

is_confusable("аpple")             # → True  (contains Cyrillic а)
normalize_confusables("раypal")  # → "paypal"

# Maximum deobfuscation: homoglyphs, zalgo, invisible chars, bidi, emoji → clean text
strip_obfuscation("рroduсt")  # → "product"   (does NOT transliterate; chain transliterate() if needed)

# Pipelines
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")            # → "Real text"   (NFKC → confusables → strip bidi → collapse ws)
sanitize_user_input("pаypal")      # → "paypal"      (NFKC → strip zalgo → confusables → strip bidi → collapse ws)

Transliteration (standards-based core)

from translit import transliterate, slugify

transliterate("café")                      # → "cafe"
transliterate("Москва")                    # → "Moskva"     (Cyrillic, BGN/PCGN)
transliterate("Αθήνα")                     # → "Athina"     (Greek, BGN/PCGN)

# Named standards (Latin / Cyrillic / Greek)
transliterate("Юрий", strict_iso9=True)    # → "Jurij"      (ISO 9-style ASCII)
transliterate("Москва", gost7034=True)     # → "Moskva"     (GOST R 7.0.34)

# Language profiles (sparse overrides on top of the default table)
transliterate("Ärger", lang="de")          # → "Aerger"
transliterate("Київ", lang="uk")           # → "Kyiv"

# Auto-detect language from script
transliterate("Москва", lang="auto")       # → "Moskva"     (detects Cyrillic → Russian)

# Reverse transliteration (Latin → native script): Russian, Ukrainian, Greek
transliterate("Moskva", target="ru")       # → "Москва"
transliterate("Athina", target="el")       # → "Αθηνα"

# Slugs & filenames
slugify("café au lait")                    # → "cafe-au-lait"

Compatibility coverage (CJK and other scripts)

# Context-free, character-by-character — best-effort, unidecode-parity (see caveats below)
transliterate("北京市")                     # → "bei jing shi"   (Chinese, toneless pinyin)
transliterate("서울")                       # → "seo ul"         (Korean, Revised Romanization)
transliterate("ひらがな")                   # → "hiragana"       (Japanese, Hepburn)

Coverage tiers

translit transliterates a very wide range of scripts, but the quality guarantee differs by tier. Lead with the core; treat the rest as compatibility coverage.

Tier Scripts Policy Standard
Core (best-in-class) Latin, Cyrillic, Greek Standards-based romanization + reverse BGN/PCGN (default), ISO 9-style ASCII (strict_iso9), GOST R 7.0.34 (gost7034)
Compatibility (best-effort) CJK (Chinese / Japanese / Korean), Arabic, Hebrew, Devanagari & 9 other Indic scripts, Thai, Lao Context-free, character-by-character — same approach as Unidecode/AnyAscii Unihan kMandarin, Revised Romanization, Hepburn, UNGEGN/IAST-derived, RTGS-derived
Best-effort Georgian, Armenian, and a long tail of additional scripts Context-free coverage so input is never silently dropped see Language support

Compatibility-tier transliteration is context-free and character-by-character — no linguistic analysis, polyphony handling, or phonological rules. For CJK/Arabic/Indic this is fundamentally lossy and no better than Unidecode; it exists so translit is a complete drop-in, not because it is best-in-class there. See docs/limitations.md for trade-offs and the full per-script policy table.

Context-aware abjad (Arabic, Persian, Hebrew): an optional dictionary-backed mode (transliterate(text, context=True)) restores vowels for more readable output. It is a best-effort readability aid, not a romanization standard. See Abjad scripts.

Precompiled pipelines

from translit import security_clean, ml_normalize, catalog_key, sanitize_user_input, strip_obfuscation

# Security: NFKC → confusables → strip bidi → collapse whitespace
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")  # → "Real text"

# ML/NLP: NFKC → emoji→text → transliterate → strip accents → fold case
ml_normalize("Café ☕ Ünïcödé")  # → "cafe hot beverage unicode"

# Library catalog: NFKC → transliterate → confusables → strip accents → fold case
catalog_key("Москва", lang="ru")  # → "moskva"
catalog_key("ΩMEGA  café")        # → "omega cafe"

# Web input: NFKC → strip bidi → strip zero-width → strip zalgo → confusables → collapse
sanitize_user_input("pаypal")  # → "paypal" (Cyrillic а folded to Latin)

# Maximum deobfuscation: homoglyphs, zalgo, invisible chars → clean text
strip_obfuscation("рroduсt")       # → "product" (Cyrillic р→p, с→c via TR39)
strip_obfuscation("pаypаl 🔥🔥")  # → "paypal fire fire"
# Note: does NOT transliterate — chain with transliterate() if needed

Text builder

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize(form="NFKC")
    .demojize()
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# → "unicode cafe hot beverage"

Package structure

The API is organized into domain-specific namespaces. All functions are also available at the top level for convenience.

Namespace Purpose Key functions
translit.security Defense & safety analysis normalize_confusables, is_confusable, is_mixed_script, is_safe_hostname, strip_bidi, security_clean
translit Core transforms transliterate, slugify, strip_obfuscation, Text, TextPipeline
translit.normalization Unicode normalization normalize, strip_accents, fold_case, collapse_whitespace
translit.files Filename handling sanitize_filename
translit.codec Byte decoding decode_to_utf8, detect_encoding
# Namespace imports
from translit.security import is_confusable, security_clean
from translit.codec import decode_to_utf8
from translit.normalization import fold_case

# Top-level imports also work
from translit import is_confusable, security_clean, decode_to_utf8, fold_case

Language profiles

Built-in language profiles span the core and compatibility tiers, with scholarly ASCII Cyrillic support (strict_iso9; ISO 9-style digraphs, not the diacritic standard). Profiles apply sparse overrides on top of the default table (e.g. German maps üue instead of the default u).

from translit import list_langs, transliterate

print(len(list_langs()))   # 83
print(list_langs())
#  ['am', 'ar', 'as', 'ban', 'bax', 'bg', 'bn', 'bo', 'bug', 'ca', 'chr',
#   'cjm', 'cop', 'cs', 'cy', 'da', 'de', 'dv', 'el', 'es', 'et', 'fa',
#   'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'is', 'it',
#   'ja', 'ja-kunrei', 'jv', 'ka', 'khb', 'km', 'kn', 'ko', 'lis', 'lo',
#   'lt', 'lv', 'ml', 'mn', 'mni', 'mr', 'mt', 'my', 'ne', 'nl', 'no',
#   'nod', 'nqo', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sa', 'sat', 'si',
#   'sk', 'sl', 'sq', 'sr', 'su', 'sv', 'syr', 'ta', 'tdd', 'te', 'th',
#   'tl', 'tr', 'tzm', 'uk', 'vai', 'vi', 'zh']

See Language support for the full registry, per-script policies, and tier classification.

Performance

translit is compiled Rust with O(1) compile-time perfect hash tables — no regex, no per-character Python iteration, no runtime data loading. Speed is a supporting benefit, not the headline; correctness and defense come first.

Operation Throughput vs. legacy
Transliterate (Latin) ~450M chars/sec ~38× faster than Unidecode
Transliterate (Cyrillic) ~106M chars/sec ~15× faster than Unidecode
Slugify ~712K slugs/sec ~10–24× faster than python-slugify
Batch transliterate (100 strings) ~2.8× faster than loop

Figures are throughput on a commodity 4‑vCPU x86‑64 Linux runner (min‑of‑N perf_counter); they are hardware‑dependent and directional, not guarantees. Latin and batch numbers are conservative (both exceed the figure above on that hardware). See docs/performance.md for full benchmark methodology and results.

Drop-in replacement

translit provides compatibility aliases for painless migration from existing libraries:

from translit import unidecode, casefold, remove_accents

unidecode("café")        # → "cafe"       (alias for transliterate)
casefold("Straße")       # → "strasse"    (alias for fold_case)
remove_accents("café")   # → "cafe"       (alias for strip_accents)

sanitize_filename() also accepts replacement_text and max_len kwargs for pathvalidate compatibility, and is_confusable() accepts greedy for confusable_homoglyphs compatibility. See migration guides for details.

Security note: the unidecode alias is for coverage compatibility only. For security/defense use it is the wrong tool (phonetic mapping does not reverse homoglyph attacks and can degrade downstream accuracy). Use strip_obfuscation / normalize_confusables instead — see Migration from Unidecode.

Exhaustive testing

translit is exhaustively tested with three layers of machine-verifiable assurance beyond conventional unit and property-based tests:

  • Compile-time assertions: build.rs asserts all transliteration table values are ASCII and entry counts match expectations — if any check fails, cargo build fails
  • Exhaustive domain coverage: Every Hangul syllable (11,172), every BMP codepoint (63,488), every CJK ideograph (20,992), and every Indic script block are tested individually — zero sampling gaps
  • Stated invariants: Seven stated properties (ASCII passthrough, idempotence, determinism, output bounds, etc.) verified by exhaustive enumeration and Hypothesis

See docs/formal-verification.md for details.

Architecture

Rust core with compile-time PHF (perfect hash function) tables for O(1) per-character lookup. Exposed to Python via PyO3 with the stable ABI (abi3-py39). The Chinese pinyin table contains 20,924 entries from the Unicode Unihan database; Korean romanization is purely algorithmic (jamo decomposition, ~100 lines of Rust).

Links

Source code https://github.com/raeq/translit
Releases https://github.com/raeq/translit/releases
PyPI package https://pypi.org/project/translit-rs/
Documentation https://translit.readthedocs.io/
Issue tracker https://github.com/raeq/translit/issues
Changelog https://github.com/raeq/translit/blob/main/CHANGELOG.md

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

translit_rs-0.6.2.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

translit_rs-0.6.2-cp39-abi3-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

translit_rs-0.6.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

translit_rs-0.6.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

translit_rs-0.6.2-cp39-abi3-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

translit_rs-0.6.2-cp39-abi3-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file translit_rs-0.6.2.tar.gz.

File metadata

  • Download URL: translit_rs-0.6.2.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for translit_rs-0.6.2.tar.gz
Algorithm Hash digest
SHA256 6197d6728871d78f04d943a613b7cf2aac2a6c2dcad3530590fb751d0f2f2e21
MD5 a9dc223d49659e25535c448581806427
BLAKE2b-256 86e30ba0adf3ad3a30af84b0844d64e28ef26f0923168aa55edda4c107933c5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.6.2.tar.gz:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.6.2-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: translit_rs-0.6.2-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for translit_rs-0.6.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2c4f32eb1c89e9bac82d0c432739eed108f919bf0738190bfdf8ce6c17a688c4
MD5 b21656172168f87d1828b9b55a3d3dfd
BLAKE2b-256 ee6689df8091ec311bb02721d8b08722f4fe0d992ae155bfece08f4de754f18f

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.6.2-cp39-abi3-win_amd64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.6.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.6.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3deae0d1ffeae6709957eb4479b0c7e8f92db93d395a1a3b9c21cccf6abe8e07
MD5 40de609c492e111bb9aa8decffd83789
BLAKE2b-256 baa4aa33f62175cae549b0ab99a07da54bee10ba641c9f8839343dcf4f00fe7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.6.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.6.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for translit_rs-0.6.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7c13ba851bfc878ea1a8e8b9b95533b437dfb1296b9b30b0521035082f90af22
MD5 0d206c52fd95e7bc815c1877ef562398
BLAKE2b-256 359747ab9bf1ad2474b908402afcfd67b088622fda8a00150552e558c7008dc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.6.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.6.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for translit_rs-0.6.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7fbbbd2583956d4afb1f8c5974179899d18da4d2036b9123df1bfef06d4cec69
MD5 380dffdb8d46df8c555228023d949a40
BLAKE2b-256 365dddd61413677fa19d07baecf5b9fad6cfa82c761fb0bad8968dc68857f196

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.6.2-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file translit_rs-0.6.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for translit_rs-0.6.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f9c26cf6da87d36e79ab75122fe857692f6df8bc62cc013c1b46f78861a2fadc
MD5 f03c4a07b290d0f69e2d68f638a4ba86
BLAKE2b-256 8a4748d196c3a020f7260fb8aa152f2a6e02726ba9f17665e8327af9485ea7a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for translit_rs-0.6.2-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on raeq/translit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page