Skip to main content

zhconv as in MediaWiki, 🦀oxidized for more efficiency (with OpenCC dicts)

Project description

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs — 中文简繁及地區詞轉換

zhconv-rs converts Chinese between Traditional, Simplified and regional variants, using rulesets sourced from MediaWiki/Wikipedia and OpenCC, which are merged, flattened and prebuilt into Aho‑Corasick automata for single-pass, linear-time conversions.

🔗 Web app (wasm): https://zhconv.pages.dev (w/ OpenCC dictionaries)

⚙️ Cli: cargo install zhconv or download from releases

🦀 Rust crate: cargo add zhconv (see docs for details)

use zhconv::{zhconv, Variant};
assert_eq!(zhconv("雾失楼台,月迷津渡", Variant::ZhTW), "霧失樓臺,月迷津渡");
assert_eq!(zhconv("驛寄梅花,魚傳尺素", "zh-Hans".parse().unwrap()), "驿寄梅花,鱼传尺素");

🐍 Python package w/ wheels: pip install zhconv-rs or pip install zhconv-rs-opencc (for OpenCC dictionaries)

Python snippet
# > pip install zhconv_rs
# Convert using the built-in rulesets:
from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Convert using custom rules:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"
Deploy to Cloudflare Workers

🧩 API demo: https://zhconv.bamboo.workers.dev

Node.js package: npm install zhconv or yarn add zhconv

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest

HTML snippet
<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Variants and dictionaries

Unlike OpenCC, whose dictionaries are bidirectional (e.g., s2t, tw2s), zhconv-rs follows MediaWiki’s approach and provides one dictionary per target variant:

zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are derived from zh-Hant. zh-CN is derived from zh-Hans. Currently, zh-MO shares the same dictionary as zh-HK, and zh-MY/zh-SG share the same dictionary as zh-CN, unless additional rules are provided.

Chained dictionary groups from OpenCC are flattened and merged with MediaWiki dictionaries for each target variant, then compiled into a single Aho-Corasick automaton at build time. After internal compression, the bundled dictionaries and automata occupy ~0.6 MiB (without OpenCC) or ~2.7 MiB (with OpenCC enabled).

Performance

Even with all dictionaries enabled, zhconv-rs remains faster than most alternatives. Check with cargo bench compare --features opencc:

Comparison with other crates, targetting zh-Hans Comparison with other crates, targetting zh-TW

Conversion runs in a single pass in O(n+m) linear time by default, where n is the length of the input text and m is the maximum length of source word in dictionaries, regardless of enabled dictionaries. When converting wikitext containing MediaWiki conversion rules, the time complexity may degrade to O(n*m) in the worst case, if the corresponding function or flag is explicitly chosen.

On a typical modern PC, prebuilt converters load in a few milliseconds with default features (~2–5 ms). Enabling the optional opencc feature increases load time (typically 20–25 ms per target). Throughput generally ranges from 100–200 MB/s.

cargo bench --features opencc on AMD EPYC 7B13 (GitPod) by v0.3:

w/ default features
load/zh2Hant            time:   [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans            time:   [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW              time:   [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK              time:   [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO              time:   [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN              time:   [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG              time:   [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY              time:   [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic    time:   [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic    time:   [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time:   [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥          time:   [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k           time:   [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k           time:   [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k        time:   [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k          time:   [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k       time:   [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k         time:   [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m           time:   [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k         time:   [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k   time:   [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k       time:   [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time:   [60.205 ms 60.412 ms 60.627 ms]
w/ the additional non-default `opencc` feature
load/zh2Hant            time:   [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans            time:   [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW              time:   [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK              time:   [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO              time:   [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN              time:   [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG              time:   [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY              time:   [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic    time:   [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic    time:   [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time:   [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥          time:   [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k           time:   [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k           time:   [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k        time:   [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k          time:   [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k       time:   [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k         time:   [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m           time:   [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k         time:   [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k   time:   [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k       time:   [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time:   [74.878 ms 76.262 ms 77.818 ms]

Limitations

Accuracy

Rule-based converters cannot capture every possible linguistic nuance. Like most others, the implementation employs a leftmost-longest matching strategy (a.k.a forward maximum matching), prioritizing to the earliest and longest matches in the text. For example, if a ruleset contains both 干 → 幹, 天干 → 天干, and 天干物燥 → 天乾物燥, the converter will prefer the longer match 天乾物燥, since it appears earlier and spans more characters. This generally works well but may cause occasional mis-conversions.

Wikitext support

The implementation supports most MediaWiki conversion rules, while not fully compliant with the original MediaWiki implementation.

Since rebuilding automata dynamically is impractical, rules (e.g., -{H|zh-hans:鹿|zh-hant:马}- in MediaWiki syntax) in text are extracted in a first pass, a temporary automaton is constructed, and the text is converted in a second pass. The time complexity may degrade to O(n*m) in the worst case, where n is the input text length and m is the maximum length of source words in dictionaries, which is equivalent to a brute-force approach.

Credits

Rulesets/Dictionaries: MediaWiki and OpenCC.

Fast double-array Aho-Corasick automata implementation in Rust: daachorse

References & related implementations:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhconv_rs_opencc-0.4.0.tar.gz (6.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zhconv_rs_opencc-0.4.0-cp39-abi3-win_amd64.whl (3.4 MB view details)

Uploaded CPython 3.9+Windows x86-64

zhconv_rs_opencc-0.4.0-cp39-abi3-win32.whl (3.4 MB view details)

Uploaded CPython 3.9+Windows x86

zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_x86_64.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_i686.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_armv7l.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_aarch64.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.5+ i686

zhconv_rs_opencc-0.4.0-cp39-abi3-macosx_11_0_arm64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

zhconv_rs_opencc-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file zhconv_rs_opencc-0.4.0.tar.gz.

File metadata

  • Download URL: zhconv_rs_opencc-0.4.0.tar.gz
  • Upload date:
  • Size: 6.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for zhconv_rs_opencc-0.4.0.tar.gz
Algorithm Hash digest
SHA256 cf649e32301b98c2e5b727eea8144bebde71e8b97f79dcd1db4c7b28c8bf418a
MD5 e494cc076a0a70e63010d674d05324e0
BLAKE2b-256 41fdc373af1e47865fd69466a3cb4666bdfe8c96d75c98ac285dd594540a3992

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 bbbf0ecd82a4594f23c28056656b03de615199890789ddf1547b448dad5bdf7c
MD5 46eebfae12fbc60bb9d2d9986bbe26c0
BLAKE2b-256 0086d71038f2950b8d0da55e8215ed4d99f9b2e2c2cb5c42675dcbfe40e929c2

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 d773477b2a9fcecc48ed803537923fa7a8ddcb7ef064d5ff47339af5973d16eb
MD5 250e51002d18b15fc6fe373ac59efb15
BLAKE2b-256 9b48ff0ee66255ab7d742cf2ab53cd0af01d8aa6c2d9e38e39018bb922060eae

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 8aeadba840706c2f647584c8ceaf2b36fe225628791dbae83b384d70693cbcb1
MD5 e831491bb2d70ddfd8bb2a4d12a870f5
BLAKE2b-256 b5bdaff8b04d8bac03be61b16b0cc34f223ba12e00a73333843b3154cf21f74a

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 8688bf71b5c50472597d2bc50ff524dd371191c745d8c6cdfc84443f0ce10c16
MD5 b4a5a87b6df1e21852209e196059d2e4
BLAKE2b-256 8ce7b3ff26345b31eede596bb865b5b267471a2c6fb95d32d7b5d58b3547e3c4

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 b3a8980bd61b50868771cdf2e1a75e8b17eb62ce4a257bfc67e3ff1701cadcab
MD5 1c2f3b6084dc0e2c3b7116841a75e8e9
BLAKE2b-256 2e405949032563a8741d7832493635ec204e32d0098c810667038fa05069593b

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 9c1fdc6ff7014d6a00a19f8ec1ed00fcbf85d3c8670bb0aeb280cfbbf072ade9
MD5 7424baee1d837878c9e248d24daa3e4a
BLAKE2b-256 472e8a4130cb24d719d7ce11f34ca1e57fd49c1a453009521d07aedc348bf16e

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af9502eec95ac32cdfea6261d8fb3956fbb889f41e4e19cc112fbd8064ee6af2
MD5 dac730f1b41e559ad2fec860308eb918
BLAKE2b-256 0afbcf8de34e240050b7b890bdd78da64ebddde78ee0ac27eaab3720c4aa8a63

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 1cf4e8d565d03886abff8002c5ffdd5b75aa4771c69f520879eb443eaa871008
MD5 897942bac6a94ae4bd840e6887142f2b
BLAKE2b-256 dcfd22c713c7bd808421db96f4847dc9bacfc93868d49b9897456acdbf4ccfa4

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 f5d9b6b465e9dae987346b08e3aff29ff264aa8373587704911b533a1c8d3aaf
MD5 e74a4d421f9ae69b582859fc24df31ed
BLAKE2b-256 fed2caebdd0f41878d37dac9fe8d3355ddee6f3019b6b8245a341336c524fd9e

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 fbf873e7338e95a272b96c41808c6f12bf5266625533dc5e02e3f471025e5121
MD5 23453f244b78c4d1a8e873d2b0fafcc0
BLAKE2b-256 db0599c972e99442cf4f7291e6bd7e69ad8bce67ed6e414e712eec078a74b95c

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5965036dec012f4f0dea2a4f7d08b4d0987e1a74802963284519916429d2babf
MD5 208a148e375ba46f8ba60321823984df
BLAKE2b-256 a41c22efea0e140a65fe1018d5ef9cad1d07b1d6e38fe2ac096ddf50238a8378

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 4903ae87e797786d871c887dd8843002c0b7ad28758b92b8a4f4288e38defe9f
MD5 6388beb25f92f24bef1c33f2785da690
BLAKE2b-256 3db92969a8fea067cbd314d45661924f161572b3d769e198d860c708ed98c964

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 64099b2cfaeccf33c03a01dc8eae7f8c1f1ae00638c560c943f22a3a9cf9a97a
MD5 7932d6731c8a286de97c38ed16490c32
BLAKE2b-256 a794bc480a9bd68d5c2f4c28f9b1d6b399282b2fbccd5138a2741e824a4267e9

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8b51d1affbb333a98e71e303062752df68b0587ece49b8c61d77ce51b9917f72
MD5 c1b53f35deb8b915497c5b129fc3eb10
BLAKE2b-256 e0968ab6686c8a07ee66719a0c03088ab93827d3d114e894a625b182bf26f99a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page