Skip to main content

zhconv as in MediaWiki, 🦀oxidized for more efficiency

Project description

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs — 中文简繁及地區詞轉換

zhconv-rs converts Chinese between Traditional, Simplified and regional variants, using rulesets sourced from MediaWiki/Wikipedia and OpenCC, which are merged, flattened and prebuilt into Aho‑Corasick automata for single-pass, linear-time conversions.

🔗 Web app (wasm): https://zhconv.pages.dev

⚙️ Cli: cargo install zhconv or download from releases

🦀 Rust crate: cargo add zhconv (see docs for details)

use zhconv::{zhconv, Variant};
assert_eq!(zhconv("雾失楼台,月迷津渡", Variant::ZhTW), "霧失樓臺,月迷津渡");
assert_eq!(zhconv("驛寄梅花,魚傳尺素", "zh-Hans".parse().unwrap()), "驿寄梅花,鱼传尺素");

🐍 Python package: pip install zhconv-rs or pip install zhconv-rs-opencc (for additional OpenCC dictionaries)

from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
More usage
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Customize conversion tables:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"
Deploy to Cloudflare Workers

🧩 API demo: https://zhconv.bamboo.workers.dev

Node.js package: npm install zhconv or yarn add zhconv

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest

HTML snippet
<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Variants and conversion tables

Unlike OpenCC, whose dictionaries are bidirectional (e.g., s2t, tw2s), zhconv-rs follows MediaWiki’s approach and provides one conversion table per target variant:

zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are derived from zh-Hant. zh-CN is derived from zh-Hans. Currently, zh-MO shares the same dictionary as zh-HK, and zh-MY/zh-SG share the same dictionary as zh-CN, unless additional rules are provided.

Chained dictionary groups from OpenCC are flattened and merged with the MediaWiki conversion table for each target variant, then compiled into an Aho-Corasick automaton at compile-time. After internal compression, the bundled conversion tables and automata occupy ~0.6 MiB (with MediWiki enabled only) or ~2.7 MiB (with both MediaWiki and OpenCC enabled).

Performance

Even with all rulesets enabled, zhconv-rs remains faster than most alternatives. Check with cargo bench compare --features bench,mediawiki,opencc:

Comparison with other crates, targetting zh-Hans Comparison with other crates, targetting zh-TW

Conversion runs in a single pass in O(n+m) linear time by default, where n is the length of the input text and m is the maximum length of source word in conversion tables, regardless of which rulesets are enabled. When converting wikitext containing MediaWiki conversion rules, the time complexity may degrade to O(n*m) in the worst case, if the corresponding function or flag is explicitly chosen.

On a typical modern PC, prebuilt converters load in a few milliseconds with default features (~2–5 ms). Enabling the optional opencc feature increases load time (typically 20–25 ms per target). Throughput generally ranges from 100–200 MB/s.

cargo bench base --features bench on AMD EPYC 7B13 (GitPod) by v0.3:

Using conversion tables sourced from MediaWiki by default
load/zh2Hant            time:   [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans            time:   [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW              time:   [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK              time:   [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO              time:   [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN              time:   [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG              time:   [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY              time:   [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic    time:   [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic    time:   [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time:   [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥          time:   [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k           time:   [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k           time:   [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k        time:   [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k          time:   [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k       time:   [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k         time:   [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m           time:   [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k         time:   [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k   time:   [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k       time:   [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time:   [60.205 ms 60.412 ms 60.627 ms]
Using conversion tables derived from OpenCC additionally (`--features opencc`)
load/zh2Hant            time:   [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans            time:   [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW              time:   [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK              time:   [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO              time:   [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN              time:   [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG              time:   [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY              time:   [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic    time:   [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic    time:   [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time:   [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥          time:   [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k           time:   [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k           time:   [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k        time:   [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k          time:   [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k       time:   [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k         time:   [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m           time:   [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k         time:   [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k   time:   [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k       time:   [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time:   [74.878 ms 76.262 ms 77.818 ms]

Limitations

Accuracy

Rule-based converters cannot capture every possible linguistic nuance. Like most others, the implementation employs a leftmost-longest matching strategy (a.k.a forward maximum matching), prioritizing to the earliest and longest matches in the text. For example, if a ruleset contains both 干 → 幹, 天干 → 天干, and 天干物燥 → 天乾物燥, the converter will prefer the longer match 天乾物燥, since it appears earlier and spans more characters. This generally works well but may cause occasional mis-conversions.

Wikitext support

The implementation supports most MediaWiki conversion syntax, while not fully compliant with the original MediaWiki implementation.

Since rebuilding automata dynamically is impractical, rules (e.g., -{H|zh-hans:鹿|zh-hant:马}- in MediaWiki syntax) in text are extracted in a first pass, a temporary automaton is constructed, and the text is converted in a second pass. The time complexity may degrade to O(n*m) in the worst case, where n is the input text length and m is the maximum length of source words in dictionaries, which is equivalent to a brute-force approach.

License

The library itself is licensed under MIT OR Apache-2.0, at the licensee’s option. BUT it may bundle:

  • Conversion tables from MediaWiki (the default, gated by the feature mediawiki) which are licensed under GPL-2.0-or-later.
  • Dictionaries from OpenCC (gated by the feature opencc) licensed under Apache-2.0.

To make the library MIT-compatible, disable the default mediawiki feature and enable the opencc feature for prebuilt converters & conversion tables.

Credits

Rulesets: MediaWiki and OpenCC.

Fast double-array Aho-Corasick automata implementation in Rust: daachorse

References & related implementations:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhconv_rs-0.4.1.tar.gz (6.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zhconv_rs-0.4.1-cp39-abi3-win_amd64.whl (1.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

zhconv_rs-0.4.1-cp39-abi3-win32.whl (1.3 MB view details)

Uploaded CPython 3.9+Windows x86

zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_i686.whl (1.8 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_armv7l.whl (1.8 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

zhconv_rs-0.4.1-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.5+ i686

zhconv_rs-0.4.1-cp39-abi3-macosx_11_0_arm64.whl (1.5 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

zhconv_rs-0.4.1-cp39-abi3-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file zhconv_rs-0.4.1.tar.gz.

File metadata

  • Download URL: zhconv_rs-0.4.1.tar.gz
  • Upload date:
  • Size: 6.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for zhconv_rs-0.4.1.tar.gz
Algorithm Hash digest
SHA256 7bfdbe3febac446b1475ccf6df28b059588a8328ff5c5ff0dc1dfc85c49b2e03
MD5 bb5c90fb37f728037970cb87a69fa31a
BLAKE2b-256 4e27f5376b868df9fb0349a741ac71e5a3b9f2e0f53b633163d0285405979335

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: zhconv_rs-0.4.1-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 63fe642e18d78e8424c30108e1b8ff5cc8c142577eef07a13d411192a5d4926c
MD5 871f37b86fae8003f62186b5b34ec5cd
BLAKE2b-256 64ce3df9ea0c3a662984acdd621eddaf424b3fc7a29b6e552852548b30025dd9

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-win32.whl.

File metadata

  • Download URL: zhconv_rs-0.4.1-cp39-abi3-win32.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.9+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 e034c64f48425688310d43c83328b2d2474a4876c7b9f91b43e7ca409ce8690c
MD5 7e91b582ce73aa34be04bae379871a91
BLAKE2b-256 5c2ecbd3ab71ba3cc3b8ea3c9e757dd05522136cb04616784c7376f5ea26ac9f

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d96cfde1b7527f71d2ee6c35cf4f83540d1bf59e806aa854c0e6fcc60d5a0168
MD5 2cf466ed98c4dd9df265981651692b4d
BLAKE2b-256 9dfbe96f95042d2e2ca67f975d25cf4796354dd89a47ebcd5d67036b5111b6ac

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 c99d5a94602954bab4e0a7b99e1e832b2190d421e3485a02851116446f4676ea
MD5 96297c4c7e8e7f02424927c399af8f43
BLAKE2b-256 efd7a02ca904170a07cd347599c49b52efd48fdf00feab97ab526c7b0c0635cd

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 0ac45656b80547638117cf7d3f4aef94d931a08f791b9e6d596cac0f8ef6c92f
MD5 f49ee6035be5f09ba41eba42c723176d
BLAKE2b-256 59cd867d8ffaf9175a8a6cdfbc580a8abb8f7fae833bda24f624807490cd0482

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 bdef89cb1782e83a1d4759b245f0cfd42201f89ac4b312d9d5270a51a2ce7c07
MD5 ac6ddbeb8718b70fe5d27dd328a875a0
BLAKE2b-256 4e1d3aea10f47994f4269595c9a0a1f50135d87118ab0d0c4bbcf2261ce962d1

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 599f82500c928ba9e3be8bbeb246391fe249c90d92a8d1f66d1f98bb37fba06f
MD5 1ff6c3f384078553335ee82794c035b3
BLAKE2b-256 ba9c1783538a6101376ebd1f904bf361bea3e818b5239b80c4caf192a823b09e

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 76bfa849591d5e3964fd490df0ef925be520b687394001969edbfc76d2946497
MD5 af0f3462cb12acebbabcc2d858d6bc63
BLAKE2b-256 64e7beef294634fb3fb553e81d7a65ed4896dae9a73e24a4c935ca8598f7fe27

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 a5e2eaa32b3b7ed116ffae14ed034b8d71adeb9e725f112d6e65b83d029e1df1
MD5 dc78eb58b247f7676f35f3fe9f39691c
BLAKE2b-256 5877bf34a5646f772f5a6e30aa75776addda478859a8cbe10d045342c1e50610

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 1ca9134cb16b0a5c0991e28bf64d718c073d98a3dd810e3607e93e9f55f50c38
MD5 d5de12f1cb9da7729172778d0257fa8f
BLAKE2b-256 e06e68c1ecc2212b973c104521e9e1d2d470fe3f2b8801ac965cc3318a323554

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f94e561d2dab0ae021e56f409705cace269d4a512fbec78e40eb5a09c1f8e6a8
MD5 d4aaee3a94045df9cfbf49a8d4d68f58
BLAKE2b-256 349fac0243c50943cd7c7cc1cc5b6b5f9de51b343704ad603bf760b3e3798532

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 fb3ca90df105ea69f9edfe84749a104af18c782db699dd15d89db159d04ec27d
MD5 64d40db4291dde41a5a8fc26e10008b0
BLAKE2b-256 0b5be0a5d2d7566b00e6240207ffea4233e5c2f18f828e3eab9a75d5db4b2b07

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5531ef6c0437de5df9040dfe79671edc7f77d21a105f8b943e4d74736603e85f
MD5 abe028400f1f6e5017495dd5d3946928
BLAKE2b-256 0b93e4beab3028bbf4f0d3c7c35ec7eef006b67efd882912de951ace442fa617

See more details on using hashes here.

File details

Details for the file zhconv_rs-0.4.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs-0.4.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bd46d150cfa0958f83d13ce244be25b2dff0689d3a03a89bace9e5ceb89754de
MD5 aa41ab6a08e66d043441ec2b2b3301a6
BLAKE2b-256 cbf184c72fd14c669866547d6abf6915b34a42f7834e1435be89d4076bc66055

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page