Skip to main content

zhconv as in MediaWiki, oxidized with more efficiency 🦀

Project description

CI status docs.rs Crates.io PyPI version NPM version

zhconv-rs 中文简繁及地區詞轉換

zhconv-rs converts Chinese text among several scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), built on the top of zhConversion.php conversion tables from MediaWiki, which is the one also used on Chinese Wikipedia.

🔗 Web App: https://zhconv.pages.dev (powered by WASM)

⚙️ Cli: cargo install zhconv-cli or check releases.

🦀 Rust Crate: cargo add zhconv (see doc comments and cli/ for examples)

🐍 Python Package via PyO3: pip install zhconv-rs (WASM with wheels)

Python snippet
# Convert with builtin rulesets:
from zhconv_rs import zhconv
assert zhconv("天干物燥 小心火烛", "zh-tw") == "天乾物燥 小心火燭"
assert zhconv("霧失樓臺,月迷津渡", "zh-hans") == "雾失楼台,月迷津渡"
assert zhconv("《-{zh-hans:三个火枪手;zh-hant:三劍客;zh-tw:三劍客}-》是亞歷山大·仲馬的作品。", "zh-cn", mediawiki=True) == "《三个火枪手》是亚历山大·仲马的作品。"
assert zhconv("-{H|zh-cn:雾都孤儿;zh-tw:孤雛淚;zh-hk:苦海孤雛;zh-sg:雾都孤儿;zh-mo:苦海孤雛;}-《雾都孤儿》是查尔斯·狄更斯的作品。", "zh-tw", True) == "《孤雛淚》是查爾斯·狄更斯的作品。"

# Convert with custom rules:
from zhconv_rs import make_converter
assert make_converter(None, [("天", "地"), ("水", "火")])("甘肅天水") == "甘肅地火"

import io
convert = make_converter("zh-hans", io.StringIO("䖏 处\n罨畫 掩画")) # or path to rule file
assert convert("秀州西去湖州近 幾䖏樓臺罨畫間") == "秀州西去湖州近 几处楼台掩画间"

JS (Webpack): npm install zhconv or yarn add zhconv (WASM, instructions)

JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest (WASM)

HTML snippet
<script type="module">
    // Use ES module import syntax to import functionality from the module
    // that we have compiled.
    //
    // Note that the `default` import is an initialization function which
    // will "boot" the module and make it ready to use. Currently browsers
    // don't support natively imported WebAssembly as an ES module, but
    // eventually the manual initialization won't be required!
    import init, { zhconv } from 'https://cdn.jsdelivr.net/npm/zhconv-web@latest/zhconv.js'; // specify a version tag if in prod

    async function run() {
        await init();

        alert(zhconv(prompt("Text to convert to zh-hans:"), "zh-hans"));
    }

    run();
</script>

Supported variants

zh-Hant, zh-Hans, zh-TW, zh-HK, zh-MO, zh-CN, zh-SG, zh-MY
Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same conversion table with zh-HK unless additonal rules / CGroups are applied; zh-MY and zh-SG shares the same conversion table with zh-CN unless additional rules / CGroups are applied.

Performance

cargo bench on Intel(R) Xeon(R) CPU @ 2.80GHz (GitPod), without parsing inline conversion rules:

load zh2Hant            time:   [45.442 ms 45.946 ms 46.459 ms]
load zh2Hans            time:   [8.1378 ms 8.3787 ms 8.6414 ms]
load zh2TW              time:   [60.209 ms 61.261 ms 62.407 ms]
load zh2HK              time:   [89.457 ms 90.847 ms 92.297 ms]
load zh2MO              time:   [96.670 ms 98.063 ms 99.586 ms]
load zh2CN              time:   [27.850 ms 28.520 ms 29.240 ms]
load zh2SG              time:   [28.175 ms 28.963 ms 29.796 ms]
load zh2MY              time:   [27.142 ms 27.635 ms 28.143 ms]
zh2TW data54k           time:   [546.10 us 553.14 us 561.24 us]
zh2CN data54k           time:   [504.34 us 511.22 us 518.59 us]
zh2Hant data689k        time:   [3.4375 ms 3.5182 ms 3.6013 ms]
zh2TW data689k          time:   [3.6062 ms 3.6784 ms 3.7545 ms]
zh2Hant data3185k       time:   [62.457 ms 64.257 ms 66.099 ms]
zh2TW data3185k         time:   [60.217 ms 61.348 ms 62.556 ms]
zh2TW data55m           time:   [1.0773 s 1.0872 s 1.0976 s]

Limitations

The converter is implemented upon a aho-corasick automaton with the leftmost-longest matching strategy. That is, leftest matched words or phrases always take a higher priority. For example, if both 干 -> 幹 and 天干物燥 -> 天乾物燥 are specified in a ruleset, 天乾物燥 would be picked since 天干物燥 would be matched earlier at the initial position compared to at a latter position. The strategy works well most of the time. But it might also result in some unexpected cases, rarely.

Besides, since an automaton is infeasible to update after being built, the converter will have to (re)build it from scratch for every ruleset. All automata for built-in rulesets (i.e. conversion tables) are built on demand and cached by default. But, typically, such overhead would be significant if there are global conversion rules (in MediaWiki syntax like -{H|zh-hans:鹿|zh-hant:马}-) in a short text (even less efficient than a naïve implementation).

Credits

All data that powers the converter, including conversion tables and CGroups, comes from the MediaWiki project.

The project takes the following projects/pages as references:

TODO

  • Support Module:CGroup
  • Propogate error properly with Anyhow and thiserror
  • Python lib
  • More exmaples in README

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zhconv_rs-0.1.0_rc6.tar.gz (4.8 MB view hashes)

Uploaded Source

Built Distributions

zhconv_rs-0.1.0_rc6-pp38-pypy38_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.5+ x86-64

zhconv_rs-0.1.0_rc6-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.5+ x86-64

zhconv_rs-0.1.0_rc6-cp310-none-win_amd64.whl (771.2 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

zhconv_rs-0.1.0_rc6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

zhconv_rs-0.1.0_rc6-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (1.8 MB view hashes)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

zhconv_rs-0.1.0_rc6-cp39-none-win_amd64.whl (771.3 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

zhconv_rs-0.1.0_rc6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

zhconv_rs-0.1.0_rc6-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (1.8 MB view hashes)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

zhconv_rs-0.1.0_rc6-cp38-none-win_amd64.whl (771.0 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

zhconv_rs-0.1.0_rc6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

zhconv_rs-0.1.0_rc6-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (1.8 MB view hashes)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

zhconv_rs-0.1.0_rc6-cp37-none-win_amd64.whl (771.0 kB view hashes)

Uploaded CPython 3.7 Windows x86-64

zhconv_rs-0.1.0_rc6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

zhconv_rs-0.1.0_rc6-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (1.8 MB view hashes)

Uploaded CPython 3.7m macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page