Skip to main content

zhconv as in MediaWiki, 🦀oxidized for more efficiency (with OpenCC dicts)

Project description

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs 中文简繁及地區詞轉換

zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), backed by rulesets from MediaWiki/Wikipedia and OpenCC.

It leverages the Aho-Corasick algorithm for linear time complexity with respect to the length of input text and conversion rules (O(n+m)), processing dozens of MiBs text per second.

🔗 Web app: https://zhconv.pages.dev (powered by WASM)

⚙️ Cli: cargo install zhconv-cli or check releases.

🦀 Rust crate: cargo add zhconv (check docs for examples)

🐍 Python package w/ wheels via PyO3: pip install zhconv-rs or pip install zhconv-rs-opencc (with rulesets from OpenCC)

Python snippet
# > pip install zhconv_rs
# Convert with builtin rulesets:
from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Convert with custom rules:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"

JS (Webpack): npm install zhconv or yarn add zhconv (WASM, instructions)

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest (WASM)

HTML snippet
<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Supported variants

zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same rulesets with zh-HK unless additional rules are manually configured; zh-MY and zh-SG shares the same rulesets with zh-CN unless additional rules are manually configured.

Performance

cargo bench on AMD EPYC 7B13 (GitPod) by v0.3:

w/ default features
load/zh2Hant            time:   [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans            time:   [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW              time:   [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK              time:   [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO              time:   [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN              time:   [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG              time:   [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY              time:   [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic    time:   [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic    time:   [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time:   [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥          time:   [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k           time:   [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k           time:   [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k        time:   [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k          time:   [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k       time:   [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k         time:   [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m           time:   [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k         time:   [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k   time:   [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k       time:   [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time:   [60.205 ms 60.412 ms 60.627 ms]
w/ the additional non-default `opencc` feature
load/zh2Hant            time:   [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans            time:   [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW              time:   [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK              time:   [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO              time:   [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN              time:   [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG              time:   [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY              time:   [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic    time:   [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic    time:   [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time:   [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥          time:   [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k           time:   [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k           time:   [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k        time:   [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k          time:   [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k       time:   [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k         time:   [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m           time:   [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k         time:   [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k   time:   [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k       time:   [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time:   [74.878 ms 76.262 ms 77.818 ms]

By default, only rulesets from MediaWiki are used. opencc feature can be enabled with zhconv = { version = "...", features = [ "opencc" ] }. But be noted that, other than performance decrease, it accounts for at least several MiBs in build output.

Limitations

Accuracy

A rule-based converter cannot capture every possible linguistic nuance, resulting in limited accuracy. Besides, the converter employs a leftmost-longest matching strategy, prioritizing to the earliest and longest matches in the text. For instance, if a ruleset includes both 干 -> 幹 and 天干物燥 -> 天乾物燥, the converter would prioritize 天乾物燥 because 天干物燥 gets matched earlier compared to at a later position. This approach generally produces accurate results but may occasionally lead to incorrect conversions.

Wikitext support

While the implementation supports most MediaWiki conversion rules, it is not fully compliant with the original MediaWiki implementation.

For wikitext inputs containing global conversion rules (e.g., -{H|zh-hans:鹿|zh-hant:马}- in MediaWiki syntax), the implementation's time complexity may degrade to O(n*m) in the worst case, where n is the input text length and m is the maximum length of source words in the ruleset. This is equivalent to a brute-force approach.

Credits

Rulesets/Dictionaries: MediaWiki and OpenCC.

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhconv_rs_opencc-0.3.2.tar.gz (6.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zhconv_rs_opencc-0.3.2-cp39-abi3-win_amd64.whl (3.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

zhconv_rs_opencc-0.3.2-cp39-abi3-win32.whl (3.4 MB view details)

Uploaded CPython 3.9+Windows x86

zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_i686.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_armv7l.whl (4.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_aarch64.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.5+ i686

zhconv_rs_opencc-0.3.2-cp39-abi3-macosx_11_0_arm64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

zhconv_rs_opencc-0.3.2-cp39-abi3-macosx_10_12_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file zhconv_rs_opencc-0.3.2.tar.gz.

File metadata

  • Download URL: zhconv_rs_opencc-0.3.2.tar.gz
  • Upload date:
  • Size: 6.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.5

File hashes

Hashes for zhconv_rs_opencc-0.3.2.tar.gz
Algorithm Hash digest
SHA256 f3fb304567036997a394244a8ff71df5b95563ace57b7dbd799be547a36ac8f1
MD5 4e0474af1481f73672dc977a1b724954
BLAKE2b-256 faf4f242776c0f6f8b57ac0d7b45b9b2871e50889c0f25c8f28b7e1088a301bf

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c0d3295127b2cc3971ca251bbc8628dbfd6e8ec7a177bf85c863c3af91bf5413
MD5 782ea87b96fb9a579f49e83a345ea59f
BLAKE2b-256 1077bbe1acafb530baf0823c3f48793656bf64692bec946686ce610372ba09e3

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 8f57a04389ef2a4e20cab5114e24cb5a825ede036a367a039135eaea0b4c624d
MD5 0ecc2964cb1bdb2ae1ed6fdd16067b11
BLAKE2b-256 addf97a029dc2cadc8ebfce7d701b5dc6ce6957ca93b4122e1bf0620ea1b8188

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 02d3ab4ee1f2106e7b717fbaffca1f6f64192d1f51e78cbc60c6691db38d3ed2
MD5 f1ce5340aeb87366690565ab331da3b5
BLAKE2b-256 86399eca2c5d940d45b62ab407a692fcd1295e8db489f1941ed868d50e50cf21

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 3c26021c4d274a0b61747a2278ff92ed5f5aa32d6f60f9fc3d4c0d84c332864c
MD5 65d4060d3b7f5aed48f5c0d4727debde
BLAKE2b-256 9ede7202b2b08b4f5330c2b902b28b976a2fb2bce96e467e528b935e85f40aae

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 a48d2a62b58174cb0ab22ed8d9640358641ecaa986fb56988b8a41679b8e5bd9
MD5 bd5d293f48639acaf0f9a14ef1db0799
BLAKE2b-256 9a4e1bbcc2fc8a0b7ca8b82485fb58d0dcf7a891a86c636ab96ebcaedd3f9d2f

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 37e7b72c73f68d1675c3b47d14b99b4748293ec663dcf9184d2768e62bce9baa
MD5 f62f933723c866fa9e3fc4bdcbedacbb
BLAKE2b-256 e307ec400bab920ed83811e28e5a290dd951fa3c65840e45a36e33335b068591

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d02bf388aa1405c8831987f725fe04e18fc4a260e2a25938ec5e3b87319948d8
MD5 86f8863884db0ed7821faf02baba479a
BLAKE2b-256 5c9d9e23f79b3f7559bc7def0cc5386d6dd9c6290a0c6ef8ec2fae7d05cedb0b

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 fe6be52270531779847764cd3b8b85ef9bea878718d0097d83ecdac7cc97968e
MD5 6a415a350eea5ddab4c186f555069001
BLAKE2b-256 e7cb1cc00df3012aa381fca3ed57a5fb43c424eed0a647c94843688d0c73838f

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 b27f3143e1d549b5701f95add0f4dfd5fbc8491c4f6c00810608148bc0dbff41
MD5 bb4172def68d7395333a46ac5ec5aec7
BLAKE2b-256 bcc4e86765f3f70903a4d574158dda02807b2fbc23ac5be98b3291261e7a9c19

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 eeff44e8547f267adee08149fc18833969a2bfae1259baa38dd8b9c0781856e0
MD5 d35e07f17f3851362779bba794cc9f59
BLAKE2b-256 2bc123014ccbad966369b4aaccde0f30a5e4ffe5430b4d73ea69e0e325c34549

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 26886d012c240c15908c9c88bbb0909882be0f9d97e801395ce1b8359fac24e0
MD5 432b9bd1da91399b88dd68c56b6f2935
BLAKE2b-256 b6569485f822343ac9abd4c788ccaf44eb34e810dda98618d34e26f8aa8a5782

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 df07c8a6901918a715ba0981563cd67d888f996fbf33c04132e3c4858538009a
MD5 3c0091a8230c7e95a9a237af7ad15cd9
BLAKE2b-256 6e2279f749eaf67f377bacd7e3ddec980708a3e8831c50214857e3dad8d8e8e8

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b155238b5255c3aaed6a6dc78fdd172f6c5e9636ea511773a243ca1a210f053e
MD5 604c2661d22bb89fce048c4a6c5d5670
BLAKE2b-256 e943661d3be3dbec164d9d75121123e5e8451e92c05537b146aac5d9e26df43c

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.2-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.2-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4851c0b18de3e5a032a8b2ef5b02023f11b0a556137d8ff635960ec31bf573bd
MD5 b1af22c80378ba70640909ba48f0af36
BLAKE2b-256 12b67087ed7fc174f97b2b2938de93bcf26dd2356cb53f98798ceced47b43250

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page