Skip to main content

zhconv as in MediaWiki, 🦀oxidized for more efficiency (with OpenCC dicts)

Project description

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs 中文简繁及地區詞轉換

zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), backed by rulesets from MediaWiki/Wikipedia and OpenCC.

It leverages the Aho-Corasick algorithm for linear time complexity with respect to the length of input text and conversion rules (O(n+m)), processing dozens of MiBs text per second.

🔗 Web app: https://zhconv.pages.dev (powered by WASM)

⚙️ Cli: cargo install zhconv-cli or check releases.

🦀 Rust crate: cargo add zhconv (check docs for examples)

🐍 Python package w/ wheels via PyO3: pip install zhconv-rs or pip install zhconv-rs-opencc (with rulesets from OpenCC)

Python snippet
# > pip install zhconv_rs
# Convert with builtin rulesets:
from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Convert with custom rules:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"

JS (Webpack): npm install zhconv or yarn add zhconv (WASM, instructions)

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest (WASM)

HTML snippet
<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Supported variants

zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same rulesets with zh-HK unless additional rules are manually configured; zh-MY and zh-SG shares the same rulesets with zh-CN unless additional rules are manually configured.

Performance

cargo bench on AMD EPYC 7B13 (GitPod) by v0.3:

w/ default features
load/zh2Hant            time:   [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans            time:   [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW              time:   [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK              time:   [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO              time:   [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN              time:   [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG              time:   [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY              time:   [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic    time:   [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic    time:   [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time:   [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥          time:   [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k           time:   [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k           time:   [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k        time:   [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k          time:   [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k       time:   [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k         time:   [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m           time:   [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k         time:   [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k   time:   [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k       time:   [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time:   [60.205 ms 60.412 ms 60.627 ms]
w/ the additional non-default `opencc` feature
load/zh2Hant            time:   [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans            time:   [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW              time:   [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK              time:   [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO              time:   [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN              time:   [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG              time:   [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY              time:   [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic    time:   [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic    time:   [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time:   [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥          time:   [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k           time:   [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k           time:   [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k        time:   [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k          time:   [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k       time:   [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k         time:   [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m           time:   [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k         time:   [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k   time:   [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k       time:   [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time:   [74.878 ms 76.262 ms 77.818 ms]

By default, only rulesets from MediaWiki are used. opencc feature can be enabled with zhconv = { version = "...", features = [ "opencc" ] }. But be noted that, other than performance decrease, it accounts for at least several MiBs in build output.

Limitations

Accuracy

A rule-based converter cannot capture every possible linguistic nuance, resulting in limited accuracy. Besides, the converter employs a leftmost-longest matching strategy, prioritizing to the earliest and longest matches in the text. For instance, if a ruleset includes both 干 -> 幹 and 天干物燥 -> 天乾物燥, the converter would prioritize 天乾物燥 because 天干物燥 gets matched earlier compared to at a later position. This approach generally produces accurate results but may occasionally lead to incorrect conversions.

Wikitext support

While the implementation supports most MediaWiki conversion rules, it is not fully compliant with the original MediaWiki implementation.

For wikitext inputs containing global conversion rules (e.g., -{H|zh-hans:鹿|zh-hant:马}- in MediaWiki syntax), the implementation's time complexity may degrade to O(n*m) in the worst case, where n is the input text length and m is the maximum length of source words in the ruleset. This is equivalent to a brute-force approach.

Credits

Rulesets/Dictionaries: MediaWiki and OpenCC.

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhconv_rs_opencc-0.3.2.post2.tar.gz (6.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-win_amd64.whl (3.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-win32.whl (3.4 MB view details)

Uploaded CPython 3.9+Windows x86

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_i686.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_armv7l.whl (4.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_aarch64.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.5+ i686

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-macosx_11_0_arm64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

zhconv_rs_opencc-0.3.2.post2-cp39-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file zhconv_rs_opencc-0.3.2.post2.tar.gz.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2.tar.gz
Algorithm Hash digest
SHA256 0bd3f6ede43e2dff316cf4bd453a285154b812012f184f1b46896411869bb6ec
MD5 8f8986b9b6ae8a2b8162a6d296307762
BLAKE2b-256 6f88113765a7ce9c3c4a63dd9ee0048d64271fc0eb03f757d940c953a51a37bb

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 205a7128113127398ff6f35159de13927896a4962e3e2a38cfb747f3bb7388d1
MD5 2c58495d18cdf2987f195a304a549a3e
BLAKE2b-256 b8cf4117d7e398064855a50263264be29f36974720d69c774c305b7b5fc04848

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 2ab7d8c4cda9212abad5162441f5a8a35f18d94553c8dc3ca90955e5abd619c2
MD5 f7125529f012d7052e00a699f6ca960f
BLAKE2b-256 493a6c7815e2184839c024cc0f872cdf40510784f96bbb3f596264cee689ee09

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 f91d2d065b807962c8f79575f297c19da549564f1dbf42d9a9fa9bbe2bdc38f5
MD5 c26852635a49334e4d51ec50907ed5f4
BLAKE2b-256 fe19f057ddc02612c9b6a581396a1fd88ddff260b8719b0f670de87bdc60cfa3

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 d0721829245a62ce23ffaf3fa17ccb52b6e3e52f20265282acb62333936ccecc
MD5 a635450eb7e114977e1d5795f144145f
BLAKE2b-256 e04fdd953ef5cd8f88f54e1fac2026bc160654b66c8afee2640585d15628db0c

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 6f79cc12f43bd1c4bff17c6e99d80e5a530ca55c3e9472e64bb18a59c103ccdc
MD5 08adfc45b5019f81293f8bd70806e885
BLAKE2b-256 05c7f7c58e8658620d98ec610c4f4baf2c7e85a15acd33a2dbb8fdc8f26bd6ed

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 edf0977511ae43db825e24da6f66509a42183c3ec1258efcb52d209fb443c34e
MD5 9b1956702a712766ac4b30cd792a83f9
BLAKE2b-256 14a55266456c34f12d54f06a04a64c7817f4981f4194b7890868da18f5eab2a3

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a0fc5aa5f0ba52f3e59add24b68ea091b53270cd9e913c38d77819cea7a813d5
MD5 802b6895bb76b4f2ea52ad58c0e01ff3
BLAKE2b-256 6c0381b236af88e48ebc715aec0dc607b911cf793f45bf7d9f83b25ee55d2ab0

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 5da8c0b827038967c7a86335b7fca385fa28ee46757dae78d0c767d5e614cb7f
MD5 c49559f434159a5d80bc8b35b42fec7d
BLAKE2b-256 7c5fe7d3f395dc6e3e5aea098dbb468628675aefe64d8a2bc7a2a5c7bbff3a4d

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 c4d4504d3d5e24ef7042a3b801645fb6b367fbf44a415d1c4f2870c8d0a15a92
MD5 910e84bde56f1044f18a903ad564dc14
BLAKE2b-256 44e9cf24ad085e6cf6213580be8cd61f76ef7768171909b0947536b91a83e764

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 49086917737b252d8e4880efc820250465264f0c80bfcb012367501a55d77141
MD5 83c368cfb9fc3c728d77dac8a7badf17
BLAKE2b-256 11f58438c191aa000a0d435e1ca1afc299ca17a95189a8681e7e6db14b4651d1

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a3a91931dc12e75ef220666664536d99114a75383038d65613e0e345012f8ec2
MD5 ac628e07ca4b6e00c579d4b359049926
BLAKE2b-256 a8f46922d3895ea1de4854f51cbdd3eefc5c48f2992bae79658578d4d0d18169

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 184e2b8ae51089fcf188c00d526a17d7b1763fe335e042505e4110d95a4ff543
MD5 e8045c553747f9e765ac81d6d645741c
BLAKE2b-256 802b8b0d77c7dafd5fb1a2ec86646d7eeed47a383d36f35adcb0af300341c484

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 449d6309dcd8863876bbd5fe93e3dd2d727f6cdfce35feb7d01ebe5e28afb4cd
MD5 004f5f67e1ec54dba7321e10f140be1d
BLAKE2b-256 318fce79031320fd1faab49d8ffdb4a9f99cc1e480a79a0b47581824bd15d4d5

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2.post2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2.post2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 afef39141c712e56b1ad4486d9ff12bbf3ff0a24238346acd7f7a9ef1574a541
MD5 fdd328a958bd0aeab814bc7de49e3a0e
BLAKE2b-256 9c727ac95ffd6abc8b63adec83ba3a9da79d10ea672d7249b5bc8182de9deb07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page