Skip to main content

zhconv as in MediaWiki, 🦀oxidized for more efficiency (with OpenCC dicts)

Project description

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs — 中文简繁及地區詞轉換

zhconv-rs converts Chinese between Traditional, Simplified and regional variants, using rulesets sourced from MediaWiki/Wikipedia and OpenCC, which are merged, flattened and prebuilt into Aho‑Corasick automata for single-pass, linear-time conversions.

🔗 Web app (wasm): https://zhconv.pages.dev

⚙️ Cli: cargo install zhconv or download from releases

🦀 Rust crate: cargo add zhconv (see docs for details)

use zhconv::{zhconv, Variant};
assert_eq!(zhconv("雾失楼台,月迷津渡", Variant::ZhTW), "霧失樓臺,月迷津渡");
assert_eq!(zhconv("驛寄梅花,魚傳尺素", "zh-Hans".parse().unwrap()), "驿寄梅花,鱼传尺素");

🐍 Python package: pip install zhconv-rs or pip install zhconv-rs-opencc (for additional OpenCC dictionaries)

from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
More usage
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Customize conversion tables:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"
Deploy to Cloudflare Workers

🧩 API demo: https://zhconv.bamboo.workers.dev

Node.js package: npm install zhconv or yarn add zhconv

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest

HTML snippet
<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Variants and conversion tables

Unlike OpenCC, whose dictionaries are bidirectional (e.g., s2t, tw2s), zhconv-rs follows MediaWiki’s approach and provides one conversion table per target variant:

zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are derived from zh-Hant. zh-CN is derived from zh-Hans. Currently, zh-MO shares the same dictionary as zh-HK, and zh-MY/zh-SG share the same dictionary as zh-CN, unless additional rules are provided.

Chained dictionary groups from OpenCC are flattened and merged with the MediaWiki conversion table for each target variant, then compiled into an Aho-Corasick automaton at compile-time. After internal compression, the bundled conversion tables and automata occupy ~0.6 MiB (with MediWiki enabled only) or ~2.7 MiB (with both MediaWiki and OpenCC enabled).

Performance

Even with all rulesets enabled, zhconv-rs remains faster than most alternatives. Check with cargo bench compare --features bench,mediawiki,opencc:

Comparison with other crates, targetting zh-Hans Comparison with other crates, targetting zh-TW

Conversion runs in a single pass in O(n+m) linear time by default, where n is the length of the input text and m is the maximum length of source word in conversion tables, regardless of which rulesets are enabled. When converting wikitext containing MediaWiki conversion rules, the time complexity may degrade to O(n*m) in the worst case, if the corresponding function or flag is explicitly chosen.

On a typical modern PC, prebuilt converters load in a few milliseconds with default features (~2–5 ms). Enabling the optional opencc feature increases load time (typically 20–25 ms per target). Throughput generally ranges from 100–200 MB/s.

cargo bench base --features bench on AMD EPYC 7B13 (GitPod) by v0.3:

Using conversion tables sourced from MediaWiki by default
load/zh2Hant            time:   [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans            time:   [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW              time:   [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK              time:   [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO              time:   [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN              time:   [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG              time:   [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY              time:   [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic    time:   [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic    time:   [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time:   [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥          time:   [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k           time:   [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k           time:   [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k        time:   [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k          time:   [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k       time:   [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k         time:   [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m           time:   [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k         time:   [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k   time:   [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k       time:   [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time:   [60.205 ms 60.412 ms 60.627 ms]
Using conversion tables derived from OpenCC additionally (`--features opencc`)
load/zh2Hant            time:   [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans            time:   [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW              time:   [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK              time:   [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO              time:   [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN              time:   [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG              time:   [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY              time:   [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic    time:   [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic    time:   [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time:   [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥          time:   [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k           time:   [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k           time:   [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k        time:   [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k          time:   [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k       time:   [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k         time:   [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m           time:   [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k         time:   [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k   time:   [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k       time:   [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time:   [74.878 ms 76.262 ms 77.818 ms]

Limitations

Accuracy

Rule-based converters cannot capture every possible linguistic nuance. Like most others, the implementation employs a leftmost-longest matching strategy (a.k.a forward maximum matching), prioritizing to the earliest and longest matches in the text. For example, if a ruleset contains both 干 → 幹, 天干 → 天干, and 天干物燥 → 天乾物燥, the converter will prefer the longer match 天乾物燥, since it appears earlier and spans more characters. This generally works well but may cause occasional mis-conversions.

Wikitext support

The implementation supports most MediaWiki conversion syntax, while not fully compliant with the original MediaWiki implementation.

Since rebuilding automata dynamically is impractical, rules (e.g., -{H|zh-hans:鹿|zh-hant:马}- in MediaWiki syntax) in text are extracted in a first pass, a temporary automaton is constructed, and the text is converted in a second pass. The time complexity may degrade to O(n*m) in the worst case, where n is the input text length and m is the maximum length of source words in dictionaries, which is equivalent to a brute-force approach.

License

The library itself is licensed under MIT OR Apache-2.0, at the licensee’s option. BUT it may bundle:

  • Conversion tables from MediaWiki (the default, gated by the feature mediawiki) which are licensed under GPL-2.0-or-later.
  • Dictionaries from OpenCC (gated by the feature opencc) licensed under Apache-2.0.

To make the library MIT-compatible, disable the default mediawiki feature and enable the opencc feature for prebuilt converters & conversion tables.

Credits

Rulesets: MediaWiki and OpenCC.

Fast double-array Aho-Corasick automata implementation in Rust: daachorse

References & related implementations:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhconv_rs_opencc-0.4.1.tar.gz (6.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zhconv_rs_opencc-0.4.1-cp39-abi3-win_amd64.whl (3.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

zhconv_rs_opencc-0.4.1-cp39-abi3-win32.whl (3.3 MB view details)

Uploaded CPython 3.9+Windows x86

zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_i686.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_armv7l.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_aarch64.whl (3.8 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.5+ i686

zhconv_rs_opencc-0.4.1-cp39-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

zhconv_rs_opencc-0.4.1-cp39-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file zhconv_rs_opencc-0.4.1.tar.gz.

File metadata

  • Download URL: zhconv_rs_opencc-0.4.1.tar.gz
  • Upload date:
  • Size: 6.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for zhconv_rs_opencc-0.4.1.tar.gz
Algorithm Hash digest
SHA256 aae862698343e095242bdd47b0a0f74bf8f68829306c2c4e6e4d5376a1dc81e1
MD5 ab81d6361ecf1aed702474781d81f2fc
BLAKE2b-256 de1ecc77e5608364a6f0c83637af6eb1359080f93d773fe234c39edfa9c7016a

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c89d4a7cd5f1f58986f54ee244b321b2b9173f8c57b7a916a5339fd30e62ee90
MD5 918971cee77ddfe2a0e3a0b5c7f6040e
BLAKE2b-256 7c15bab56025666d2141ffeb9b7356e04e86fe09200edca596250775e45a80ec

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 277f7680039792b40486c78bb6cf5819a2bc31eb810255b13c564d20004afb70
MD5 ffb316cca13ec4e2ddd8a3788244258f
BLAKE2b-256 993fc5b971a303647becc38b1cde65ed9c5ed43b1007ad561db435ae5c6dfe40

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 0fa4b79263c3092b197c25a280d993f8d91e79fa3ba2179bb26c3e7c9127da90
MD5 90bc20394dcb965a8a24c3e0ec1420a8
BLAKE2b-256 c65074a67b44bf44fda1a7ad29b4526f51e37183113283be2030b104b698603f

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 2f59991a6fa3dd0a9cda79a889013b1da19da30ea0628e88e02e9247f3766a49
MD5 38d53e8f4b49a16b57114b061e053b9a
BLAKE2b-256 3019aac9d2590e5467c7d0cc732d6dacccab0754fdabee5d71151cc7c1e8fd62

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 92843d6bf0bfab1560660b21084723691d3ffede026a3d2a0e095c9d156e952c
MD5 74aba88798c97d9cc96e3d4976aaa155
BLAKE2b-256 51f47e56c63c99b3559bd5552b7634151f6e8d688b9c979c219465008b4f86de

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 611c70517479fa0a1202a780aa399efd8ff5ca7790396fa15085f06197d3c9ae
MD5 a7eee010b3be7beff04ba7024d5066dc
BLAKE2b-256 dd6ecb3e5d3cb27d015c02c235fc777cfe35b23e4e5e81ca2e77f07f23763727

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b3f46539bc4420dcf673d99a67a502e7286d8e5ac9f49efa898e80711cf3ef7
MD5 bb6ea7199b11b1537d1d7faf09a63c80
BLAKE2b-256 168015c34bb0f4f52e6b7118474b7804db04d9a69df8f6338c88cccd4ab38385

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 23f0d1e3e6d7ccf6f2e731d636784c243161c2c2fcf5da11dc8c4bb217b3d243
MD5 d2b8aead6e20c19ee85e80d65a82a95f
BLAKE2b-256 9c282abb6df48c22d934329aa703d21a794143d11f1e60c44383325a4ef651c7

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 5c55633f417ef5a91976b7dc8dbb16d8ceee64ed556ef0c5e6000ab4acadd481
MD5 6e48c69c8024f0b98ede8baf8f9774d5
BLAKE2b-256 cbbab33b4431cc681149f3e4c8d16eb7171a869ec3e7d4970b2fc08b9f0cc297

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 8a7a8403fbab59cc76bf208059d79789f3797634af034af0189ca28bbd82fd3e
MD5 74c7e8b246836b778ca15936a3c38b17
BLAKE2b-256 934eaac1e1d0a07b538660d17c1685c93448137ba2198f51a7fdc46d67833696

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1aca07bfd28b044cfd1bbb67c6a3eeb9f2436eeac16763f7b70be83bfe1d69c9
MD5 85409c900fddeb46b01a0a4d6340d29a
BLAKE2b-256 b77083fc59b99a60f3c5432d39607055bea264539d80fe7522d9646be5d33fec

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 d38431aaec7d123d04003fd1dc42725ac974a89af39f1614eeb59b0ef397a0fe
MD5 0d2832972d3254a1e5e7adb06f4431ad
BLAKE2b-256 72769fd348f5e92632eff5ea5ad6af5da4e7b2156644ea857ca506b1dfef30f1

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ce1060586d69007705ab93968453d3d083b1f6ca74ee964648fc1d775866d906
MD5 7a04913622f0816fc1dfbb44d90417c1
BLAKE2b-256 297d982d8a6658e0db7102c1a8841e7bd8e375fddd4d021e09a170e227e431eb

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 080923e07dd27fa356c840ee76969eeaf67056f9988b89b502e536096ca219f0
MD5 80d434cc20056317e904077834c83977
BLAKE2b-256 2605134a589e7197fa65bcf983c620586a08cf90751dbdb19657726a91f059ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page