Skip to main content

zhconv as in MediaWiki, 🦀oxidized for more efficiency (with OpenCC dicts)

Project description

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs 中文简繁及地區詞轉換

zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), backed by rulesets from MediaWiki/Wikipedia and OpenCC.

It leverages the Aho-Corasick algorithm for linear time complexity with respect to the length of input text and conversion rules (O(n+m)), processing dozens of MiBs text per second.

🔗 Web app (Wasm): https://zhconv.pages.dev (w/ OpenCC dicts)

⚙️ Cli: cargo install zhconv-cli or check releases.

🦀 Rust crate: cargo add zhconv (check docs for examples)

🐍 Python package w/ wheels: pip install zhconv-rs or pip install zhconv-rs-opencc (w/ OpenCC dicts)

Deploy to Cloudflare Workers

🧩 API demo: https://zhconv.bamboo.workers.dev

Python snippet
# > pip install zhconv_rs
# Convert with builtin rulesets:
from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Convert with custom rules:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"

JS (Webpack): npm install zhconv or yarn add zhconv (Wasm, instructions)

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest (Wasm)

HTML snippet
<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Supported variants

zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same rulesets with zh-HK unless additional rules are manually configured; zh-MY and zh-SG shares the same rulesets with zh-CN unless additional rules are manually configured.

Performance

cargo bench on AMD EPYC 7B13 (GitPod) by v0.3:

w/ default features
load/zh2Hant            time:   [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans            time:   [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW              time:   [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK              time:   [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO              time:   [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN              time:   [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG              time:   [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY              time:   [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic    time:   [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic    time:   [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time:   [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥          time:   [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k           time:   [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k           time:   [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k        time:   [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k          time:   [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k       time:   [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k         time:   [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m           time:   [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k         time:   [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k   time:   [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k       time:   [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time:   [60.205 ms 60.412 ms 60.627 ms]
w/ the additional non-default `opencc` feature
load/zh2Hant            time:   [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans            time:   [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW              time:   [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK              time:   [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO              time:   [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN              time:   [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG              time:   [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY              time:   [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic    time:   [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic    time:   [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time:   [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥          time:   [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k           time:   [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k           time:   [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k        time:   [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k          time:   [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k       time:   [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k         time:   [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m           time:   [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k         time:   [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k   time:   [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k       time:   [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time:   [74.878 ms 76.262 ms 77.818 ms]

Limitations

Accuracy

A rule-based converter cannot capture every possible linguistic nuance, resulting in limited accuracy. Besides, the converter employs a leftmost-longest matching strategy, prioritizing to the earliest and longest matches in the text. For instance, if a ruleset includes both 干 -> 幹 and 天干物燥 -> 天乾物燥, the converter would prioritize 天乾物燥 because 天干物燥 gets matched earlier compared to at a later position. This approach generally produces accurate results but may occasionally lead to incorrect conversions.

Wikitext support

While the implementation supports most MediaWiki conversion rules, it is not fully compliant with the original MediaWiki implementation.

For wikitext inputs containing global conversion rules (e.g., -{H|zh-hans:鹿|zh-hant:马}- in MediaWiki syntax), the implementation's time complexity may degrade to O(n*m) in the worst case, where n is the input text length and m is the maximum length of source words in the ruleset. This is equivalent to a brute-force approach.

Credits

Rulesets/Dictionaries: MediaWiki and OpenCC.

References:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhconv_rs_opencc-0.3.3.tar.gz (6.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

zhconv_rs_opencc-0.3.3-cp39-abi3-win_amd64.whl (3.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

zhconv_rs_opencc-0.3.3-cp39-abi3-win32.whl (3.4 MB view details)

Uploaded CPython 3.9+Windows x86

zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_x86_64.whl (4.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ x86-64

zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_i686.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ i686

zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_armv7l.whl (4.0 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARMv7l

zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_aarch64.whl (3.9 MB view details)

Uploaded CPython 3.9+musllinux: musl 1.2+ ARM64

zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (3.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ s390x

zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (3.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ppc64le

zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARMv7l

zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.5+ i686

zhconv_rs_opencc-0.3.3-cp39-abi3-macosx_11_0_arm64.whl (3.6 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

zhconv_rs_opencc-0.3.3-cp39-abi3-macosx_10_12_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file zhconv_rs_opencc-0.3.3.tar.gz.

File metadata

  • Download URL: zhconv_rs_opencc-0.3.3.tar.gz
  • Upload date:
  • Size: 6.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.1

File hashes

Hashes for zhconv_rs_opencc-0.3.3.tar.gz
Algorithm Hash digest
SHA256 c71acedbe4f9487c25dde01d90737728a175662d156929002f585fdb217ddac8
MD5 fde6c0b67fafd2063a2c8a6fa38e6099
BLAKE2b-256 fa0eaa84b21b42dfe7137704628611d409c617a5323afcfda3b1d0e78bd5b920

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e4f221f7ffa186c77823fe89fb4f4dcf69286cba6f9834ac3c5a3fa6b6c5edff
MD5 a6d4c0b6592d8d674ca47889e7a725a4
BLAKE2b-256 ab58256b5f2943bfa81cd764c49be7787074b9478b1ca90aae388f44e422244c

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-win32.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-win32.whl
Algorithm Hash digest
SHA256 c12330b7709d7065728a1be6c559eb7e42f47fa39f0560f4f593ecb7caaec43a
MD5 0004abf1bfa5228f2734a5ad53c82872
BLAKE2b-256 76bd34feb233c187cbc6c573f94c845aeb0ebe7dcf192b8225b57bd85023e3cf

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 3b3fb5bcde3ff1f3c7de1d3807c15eb3e92dd6472767f9a2eea94f76b689b353
MD5 624b139fb131f85571aac0905d8159fd
BLAKE2b-256 fa811d441e77c334d63ecccb6c526d097bc18033f5890777ed0687f398b4d4f8

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 f1bf35a51cf602f070817cc61636c6f0552eac0700793bc8594677f9215713ff
MD5 c3d9d962cf8385ca7f635b9c7b3c31c6
BLAKE2b-256 83696da0b6620f3d110d50073acdb9505dfa115fd423d649f900e9e2ae2e6325

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 b0fb14e843b209a864f4c0e600284baac66451e3e93003a141dcdf6b4a9e2e85
MD5 6d5efa0dcb7e86d54873cee6725d1cf6
BLAKE2b-256 d1cb7613fee76580c347ebf669b32f1cbb7cfcec9e61d742904610b6428ef7e3

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 17950edb27e355213a0362e505bd925f7ae40607b2ee5472337c2c363c9c689d
MD5 07a8ce3cf2e76c13c9418a39bd9bd897
BLAKE2b-256 767887a417bf6e935b4b628d0838cf7277fbb63146c3e84e87125f551e5e34fb

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 13b7da8b74ad8577dac244c1a55c94952491552df47328393903ffada6818458
MD5 a2afbafcec29372f15f0addaeb52f548
BLAKE2b-256 2d36ebb9d02e5ae4c9e065e90b53624b1ab7660c2812b241265a8ed818e2ffc5

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 505db3be739c79d85eea66ee2fc84ba53322daf324dac6ba51309d5722cc29ec
MD5 2ee88ec310e5cd314b423a3082490d05
BLAKE2b-256 65e217a520dd6456365a37fa5531767f6bb82622f35dd07a7f0954578dd314b1

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 f60990cfc7d1ae804b2a3ccda453497c85f479363b4cff8c6fcd42e4510ba836
MD5 14fcc1f44a69d25766e6ef46ae38b68d
BLAKE2b-256 86816227de03d07bee99637b29aa4ee72d60292e11ae0551498329e01dfa73a3

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 5d9c7983a777a418600af6db7b14b8fbd81a7e5f5eba81a6e7eb35b1a33a4518
MD5 a69d3a3fdc0d63511d03c7921347fa3b
BLAKE2b-256 14c07cf356a61926ac6c83c67080ec3e09338c7a899370a5e5e06195d7c34bc0

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 73cd70b831685f4933e52316e4b1becfb69123885510550745e1e3bb5fab9aed
MD5 6ac47a132872c4ad6cac55de1ac54c5d
BLAKE2b-256 62de98d585daee942f401b7c3c97aefffacf780f30d43b8f11065712c9d1ab3d

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 1ef70cbce63dc1b5bef4acc4a685b708642c0c56942426ad1d6390dae03ceec9
MD5 aaab67a936d87ec33c23930e19f82881
BLAKE2b-256 4b1aad802c07ce5aed67f37082c5728fcaf251dd4ac7fd845266c22a70e77231

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 11e70af2d4f5306c9baae7c51e2a3d500863f664976c8ec70e72b1aaa23c2f3f
MD5 29790dec3eb4a48e5528b832db1a8810
BLAKE2b-256 fe59ea59297d630158ea0ad6d043b1a7a421af11e38a15e15b7ece127be86cea

See more details on using hashes here.

File details

Details for the file zhconv_rs_opencc-0.3.3-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for zhconv_rs_opencc-0.3.3-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a4d7bc71dc2f46074be7acf9db03f82ffee15eb28ee3df279d46f007efb7092f
MD5 c38f67b816b5b6961001c249b90df405
BLAKE2b-256 3caa47d314485d4a8aab5da4e114de7ca1c3410e891ab1eff220b4216c15187f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page